varepsilon / clickmodels Goto Github PK

ClickModels is a small set of Python scripts for the user click models initially developed at Yandex. A Click Model is a probabilistic graphical model used to predict search engine click data from past observations. This project is aimed to deal with click models used in Information Retrieval (see next README.md) and intended to be easy-to-read and easy-to-modify. If it's not, please let me know how to improve it :)

License: BSD 3-Clause "New" or "Revised" License

Python 95.58% CSS 0.99% HTML 3.43%

clickmodels's Issues

Problem About Parameter Estimation of UBM

Thanks for sharing the code.

When I tested the UBM model, I encountered a question.
We all know that UBM uses the EM algorithm to estimate the parameter. For example, for a specific query q and item u, we calculate the attractiveness parameter Aqu.
Within the code, we will calculate the numerator and denominator of Aqu separately (An, Ad) and then combine to calculate Aqu = An/ Ad.
The problem is that if the first item u1 has only a few click behaviors, its attractiveness parameter may surpass the second item u2 who has many click behaviors.
For example :
the first item u1: An1=Ad1=100.
the second item u2: An2=9000, Ad2=10000
The result of Aqu1 will be bigger than Aqu2.

But I don't think it's normal. Cause the second item get more An.
Do you face the similar question during UBM model?
Looking forward to any reply. Thanks.

Add UnitTests

We have lots of class methods, but only "test" a subset of them by running ./inference.py < data/click_log_sample.tsv 2>inference.log

We need a proper unit-testing.

output explain?

what's output fields means?

problem about parameter estimation of Au

def _getSessionEstimate(self, positionRelevances, layout, clicks, intent):
....
gamma = self.getGamma(self.gammas, k, layout, intent)
# E_k_multiplier --- P(S_k = 0 | C_k) P(C_k | E_k = 1)
if C_k == 0:
sessionEstimate['a'][k] = a_u * varphi[k][0]
sessionEstimate['s'][k] = 0.0
else:
sessionEstimate['a'][k] = 1.0
sessionEstimate['s'][k] = varphi[k + 1][0] * s_u / (s_u + (1 - gamma) * (1 - s_u))
....

In the above function， sessionEstimate['a'][k] = a_u * varphi[k][0] mybe wrong，it should be a_u * varphi[k][1]？

reason：a_u = P(C_k = 1 | E_k = 1 )
varphi[k][1] = P( E_k = 1 |C_1, C_2, C_3,.....C_N )

so：
sessionEstimate['a'][k] = P(C_k = 1 | C_1, C_2, C_3,.....C_N) = P(C_k = 1 | E_k = 1 ) * P( E_k = 1 |C_1, C_2, C_3,.....C_N ) = a_u * varphi[k][1]

Do you agree with my opinion ？ Please help me solve it，thank you.

How to generate DBN_output.log

I was trying to generate DBN_output.log using below command.
python2 bin/run_inference.py < content/stats.log

I have used default config file.

Content of stats.log is below, how can we generate DBN_output.log.

1c2aa70e-b31b-43c9-a4e5-769377682d2b U2 0 ["350696","351003","351098","350967","350640","350897","351105","351029"] [false,false,false,false,false,false,false,false] [1,1,1,1,1,1,1,1]

be58c52b-2379-4628-b637-fdde90ee47ff U2 0 0 ["351098","350967","351003","350640","350696","350897","351105","351029"] [false,false,false,false,false,false,false,false] [1,0,0,0,0,0,0,0]

e66984a6-dde5-4e4d-b54f-dcb6a0d8fb18 U2 0 0 ["351003","350696","351098","350967","350640","350897","351105","351029"] [false,false,false,false,false,false,false,false] [1,1,0,0,1,0,0,0]

ubm's model doesn't update through iteration?

I am reading the code.
I find it's so wired for UBM as followed code.
The alphaFractions will be initialized at the start within each iteration, like [1.0, 2.0].
But the code will assign the each iteration's result
of alphaFractions directly to the self.alpha.
self.alpha[i][q][url] = new_alpha
Does it mean that it will never change through iteration?
Should it be in such way at the start within each iteration?
alphaFractions = copy.deepcopy(self.alpha)

        for iteration_count in xrange(self.config.get('MAX_ITERATIONS', MAX_ITERATIONS)):
            self.queryIntentsWeights = defaultdict(lambda: [])
            # not like in DBN! xxxFractions[0] is a numerator while xxxFraction[1] is a denominator
            alphaFractions = dict((i, [defaultdict(lambda: [1.0, 2.0]) for q in xrange(max_query_id)]) for i in possibleIntents)
            gammaFractions = [[[[1.0, 2.0] \
                for d in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY))] \
                    for r in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY))] \
                        for g in xrange(self.gammaTypesNum)]
            if self.explorationBias:
                eFractions = [[1.0, 2.0] \
                        for p in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY))]
            # E-step
            for s in sessions:
                query = s.query
                layout = [False] * len(s.layout) if self.ignoreLayout else s.layout
                if self.explorationBias:
                    explorationBiasPossible = any((l and c for (l, c) in zip(s.layout, s.clicks)))
                    firstVerticalPos = -1 if not any(s.layout[:-1]) else [k for (k, l) in enumerate(s.layout) if l][0]
                if self.ignoreIntents:
                    p_I__C_G = {False: 1.0, True: 0}
                else:
                    a = self._getSessionProb(s) * (1 - s.intentWeight)
                    b = 1 * s.intentWeight
                    p_I__C_G = {False: a / (a + b), True: b / (a + b)}
                self.queryIntentsWeights[query].append(p_I__C_G[True])
                prevClick = -1
                for rank, c in enumerate(s.clicks):
                    url = s.results[rank]
                    for intent in possibleIntents:
                        a = self.alpha[intent][query][url]
                        if self.explorationBias and explorationBiasPossible:
                            e = self.e[firstVerticalPos]
                        if c == 0:
                            g = self.getGamma(self.gamma, rank, prevClick, layout, intent)
                            gCorrection = 1
                            if self.explorationBias and explorationBiasPossible and not s.layout[k]:
                                gCorrection = 1 - e
                                g *= gCorrection
                            alphaFractions[intent][query][url][0] += a * (1 - g) / (1 - a * g) * p_I__C_G[intent]
                            self.getGamma(gammaFractions, rank, prevClick, layout, intent)[0] += g / gCorrection * (1 - a) / (1 - a * g) * p_I__C_G[intent]
                            if self.explorationBias and explorationBiasPossible:
                                eFractions[firstVerticalPos][0] += (e if s.layout[k] else e / (1 - a * g)) * p_I__C_G[intent]
                        else:
                            alphaFractions[intent][query][url][0] += 1 * p_I__C_G[intent]
                            self.getGamma(gammaFractions, rank, prevClick, layout, intent)[0] += 1 * p_I__C_G[intent]
                            if self.explorationBias and explorationBiasPossible:
                                eFractions[firstVerticalPos][0] += (e if s.layout[k] else 0) * p_I__C_G[intent]
                        alphaFractions[intent][query][url][1] += 1 * p_I__C_G[intent]
                        self.getGamma(gammaFractions, rank, prevClick, layout, intent)[1] += 1 * p_I__C_G[intent]
                        if self.explorationBias and explorationBiasPossible:
                            eFractions[firstVerticalPos][1] += 1 * p_I__C_G[intent]
                    if c != 0:
                        prevClick = rank
            if not self.config.get('PRETTY_LOG', PRETTY_LOG):
                sys.stderr.write('E')
            # M-step
            sum_square_displacement = 0.0
            for i in possibleIntents:
                for q in xrange(max_query_id):
                    for url, aF in alphaFractions[i][q].iteritems():
                        new_alpha = aF[0] / aF[1]
                        sum_square_displacement += (self.alpha[i][q][url] - new_alpha) ** 2
                        self.alpha[i][q][url] = new_alpha
            for g in xrange(self.gammaTypesNum):
                for r in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY)):
                    for d in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY)):
                        gF = gammaFractions[g][r][d]
                        new_gamma = gF[0] / gF[1]
                        sum_square_displacement += (self.gamma[g][r][d] - new_gamma) ** 2
                        self.gamma[g][r][d] = new_gamma
            if self.explorationBias:
                for p in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY)):
                    new_e = eFractions[p][0] / eFractions[p][1]
                    sum_square_displacement += (self.e[p] - new_e) ** 2
                    self.e[p] = new_e
            if not self.config.get('PRETTY_LOG', PRETTY_LOG):
                sys.stderr.write('M\n')
            rmsd = math.sqrt(sum_square_displacement)
            if self.config.get('PRETTY_LOG', PRETTY_LOG):
                sys.stderr.write('%d..' % (iteration_count + 1))
            else:
                print >>sys.stderr, 'Iteration: %d, ERROR: %f' % (iteration_count + 1, rmsd)

Make it easy to not use intent probabilities

Currently all the new code has to support the intent probabilities. This requires thinking about correct probabilities and marginalization all the time and just makes the code complex. We should make the core easier to extend, possibly by dropping support of these probabilities.

the difference about your dbn model and other dbn model in github？

hi
i find another version of dbn model。 it has so much difference with your verison。 can you explain whoes is right？

another version address：
        https://github.com/markovi/PyClick

Question about when I have multiple logs for the same query

Hi, this is just a question. I want to start experimenting with your library. Now I am not sure whether the input lines should have one line for each query, or that each query can have multiple lines. So if two visitors execute the same query, but do different clicks, is that two lines?

Thanks

varepsilon / clickmodels Goto Github PK

clickmodels's Issues

Problem About Parameter Estimation of UBM

Add UnitTests

output explain?

problem about parameter estimation of Au

How to generate DBN_output.log

ubm's model doesn't update through iteration?

Make it easy to not use intent probabilities

the difference about your dbn model and other dbn model in github？

Question about when I have multiple logs for the same query

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent