Git Product home page Git Product logo

varepsilon / clickmodels Goto Github PK

View Code? Open in Web Editor NEW
233.0 21.0 71.0 277 KB

ClickModels is a small set of Python scripts for the user click models initially developed at Yandex. A Click Model is a probabilistic graphical model used to predict search engine click data from past observations. This project is aimed to deal with click models used in Information Retrieval (see next README.md) and intended to be easy-to-read and easy-to-modify. If it's not, please let me know how to improve it :)

License: BSD 3-Clause "New" or "Revised" License

Python 95.58% CSS 0.99% HTML 3.43%

clickmodels's Issues

Problem About Parameter Estimation of UBM

Thanks for sharing the code.

When I tested the UBM model, I encountered a question.
We all know that UBM uses the EM algorithm to estimate the parameter. For example, for a specific query q and item u, we calculate the attractiveness parameter Aqu.
Within the code, we will calculate the numerator and denominator of Aqu separately (An, Ad) and then combine to calculate Aqu = An/ Ad.
The problem is that if the first item u1 has only a few click behaviors, its attractiveness parameter may surpass the second item u2 who has many click behaviors.
For example :
the first item u1: An1=Ad1=100.
the second item u2: An2=9000, Ad2=10000
The result of Aqu1 will be bigger than Aqu2.

But I don't think it's normal. Cause the second item get more An.
Do you face the similar question during UBM model?
Looking forward to any reply. Thanks.

Add UnitTests

We have lots of class methods, but only "test" a subset of them by running ./inference.py < data/click_log_sample.tsv 2>inference.log

We need a proper unit-testing.

problem about parameter estimation of Au

def _getSessionEstimate(self, positionRelevances, layout, clicks, intent):
....
gamma = self.getGamma(self.gammas, k, layout, intent)
# E_k_multiplier --- P(S_k = 0 | C_k) P(C_k | E_k = 1)
if C_k == 0:
sessionEstimate['a'][k] = a_u * varphi[k][0]
sessionEstimate['s'][k] = 0.0
else:
sessionEstimate['a'][k] = 1.0
sessionEstimate['s'][k] = varphi[k + 1][0] * s_u / (s_u + (1 - gamma) * (1 - s_u))
....

In the above function, sessionEstimate['a'][k] = a_u * varphi[k][0] mybe wrong,it should be a_u * varphi[k][1]?

reason:a_u = P(C_k = 1 | E_k = 1 )
varphi[k][1] = P( E_k = 1 |C_1, C_2, C_3,.....C_N )

so:
sessionEstimate['a'][k] = P(C_k = 1 | C_1, C_2, C_3,.....C_N) = P(C_k = 1 | E_k = 1 ) * P( E_k = 1 |C_1, C_2, C_3,.....C_N ) = a_u * varphi[k][1]

Do you agree with my opinion ? Please help me solve it,thank you.

How to generate DBN_output.log

I was trying to generate DBN_output.log using below command.
python2 bin/run_inference.py < content/stats.log

I have used default config file.

Content of stats.log is below, how can we generate DBN_output.log.

1c2aa70e-b31b-43c9-a4e5-769377682d2b U2 0 ["350696","351003","351098","350967","350640","350897","351105","351029"] [false,false,false,false,false,false,false,false] [1,1,1,1,1,1,1,1]

be58c52b-2379-4628-b637-fdde90ee47ff U2 0 0 ["351098","350967","351003","350640","350696","350897","351105","351029"] [false,false,false,false,false,false,false,false] [1,0,0,0,0,0,0,0]

e66984a6-dde5-4e4d-b54f-dcb6a0d8fb18 U2 0 0 ["351003","350696","351098","350967","350640","350897","351105","351029"] [false,false,false,false,false,false,false,false] [1,1,0,0,1,0,0,0]

ubm's model doesn't update through iteration?

I am reading the code.
I find it's so wired for UBM as followed code.
The alphaFractions will be initialized at the start within each iteration, like [1.0, 2.0].
But the code will assign the each iteration's result
of alphaFractions directly to the self.alpha.
self.alpha[i][q][url] = new_alpha
Does it mean that it will never change through iteration?
Should it be in such way at the start within each iteration?
alphaFractions = copy.deepcopy(self.alpha)

        for iteration_count in xrange(self.config.get('MAX_ITERATIONS', MAX_ITERATIONS)):
            self.queryIntentsWeights = defaultdict(lambda: [])
            # not like in DBN! xxxFractions[0] is a numerator while xxxFraction[1] is a denominator
            alphaFractions = dict((i, [defaultdict(lambda: [1.0, 2.0]) for q in xrange(max_query_id)]) for i in possibleIntents)
            gammaFractions = [[[[1.0, 2.0] \
                for d in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY))] \
                    for r in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY))] \
                        for g in xrange(self.gammaTypesNum)]
            if self.explorationBias:
                eFractions = [[1.0, 2.0] \
                        for p in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY))]
            # E-step
            for s in sessions:
                query = s.query
                layout = [False] * len(s.layout) if self.ignoreLayout else s.layout
                if self.explorationBias:
                    explorationBiasPossible = any((l and c for (l, c) in zip(s.layout, s.clicks)))
                    firstVerticalPos = -1 if not any(s.layout[:-1]) else [k for (k, l) in enumerate(s.layout) if l][0]
                if self.ignoreIntents:
                    p_I__C_G = {False: 1.0, True: 0}
                else:
                    a = self._getSessionProb(s) * (1 - s.intentWeight)
                    b = 1 * s.intentWeight
                    p_I__C_G = {False: a / (a + b), True: b / (a + b)}
                self.queryIntentsWeights[query].append(p_I__C_G[True])
                prevClick = -1
                for rank, c in enumerate(s.clicks):
                    url = s.results[rank]
                    for intent in possibleIntents:
                        a = self.alpha[intent][query][url]
                        if self.explorationBias and explorationBiasPossible:
                            e = self.e[firstVerticalPos]
                        if c == 0:
                            g = self.getGamma(self.gamma, rank, prevClick, layout, intent)
                            gCorrection = 1
                            if self.explorationBias and explorationBiasPossible and not s.layout[k]:
                                gCorrection = 1 - e
                                g *= gCorrection
                            alphaFractions[intent][query][url][0] += a * (1 - g) / (1 - a * g) * p_I__C_G[intent]
                            self.getGamma(gammaFractions, rank, prevClick, layout, intent)[0] += g / gCorrection * (1 - a) / (1 - a * g) * p_I__C_G[intent]
                            if self.explorationBias and explorationBiasPossible:
                                eFractions[firstVerticalPos][0] += (e if s.layout[k] else e / (1 - a * g)) * p_I__C_G[intent]
                        else:
                            alphaFractions[intent][query][url][0] += 1 * p_I__C_G[intent]
                            self.getGamma(gammaFractions, rank, prevClick, layout, intent)[0] += 1 * p_I__C_G[intent]
                            if self.explorationBias and explorationBiasPossible:
                                eFractions[firstVerticalPos][0] += (e if s.layout[k] else 0) * p_I__C_G[intent]
                        alphaFractions[intent][query][url][1] += 1 * p_I__C_G[intent]
                        self.getGamma(gammaFractions, rank, prevClick, layout, intent)[1] += 1 * p_I__C_G[intent]
                        if self.explorationBias and explorationBiasPossible:
                            eFractions[firstVerticalPos][1] += 1 * p_I__C_G[intent]
                    if c != 0:
                        prevClick = rank
            if not self.config.get('PRETTY_LOG', PRETTY_LOG):
                sys.stderr.write('E')
            # M-step
            sum_square_displacement = 0.0
            for i in possibleIntents:
                for q in xrange(max_query_id):
                    for url, aF in alphaFractions[i][q].iteritems():
                        new_alpha = aF[0] / aF[1]
                        sum_square_displacement += (self.alpha[i][q][url] - new_alpha) ** 2
                        self.alpha[i][q][url] = new_alpha
            for g in xrange(self.gammaTypesNum):
                for r in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY)):
                    for d in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY)):
                        gF = gammaFractions[g][r][d]
                        new_gamma = gF[0] / gF[1]
                        sum_square_displacement += (self.gamma[g][r][d] - new_gamma) ** 2
                        self.gamma[g][r][d] = new_gamma
            if self.explorationBias:
                for p in xrange(self.config.get('MAX_DOCS_PER_QUERY', MAX_DOCS_PER_QUERY)):
                    new_e = eFractions[p][0] / eFractions[p][1]
                    sum_square_displacement += (self.e[p] - new_e) ** 2
                    self.e[p] = new_e
            if not self.config.get('PRETTY_LOG', PRETTY_LOG):
                sys.stderr.write('M\n')
            rmsd = math.sqrt(sum_square_displacement)
            if self.config.get('PRETTY_LOG', PRETTY_LOG):
                sys.stderr.write('%d..' % (iteration_count + 1))
            else:
                print >>sys.stderr, 'Iteration: %d, ERROR: %f' % (iteration_count + 1, rmsd)

Make it easy to not use intent probabilities

Currently all the new code has to support the intent probabilities. This requires thinking about correct probabilities and marginalization all the time and just makes the code complex. We should make the core easier to extend, possibly by dropping support of these probabilities.

Question about when I have multiple logs for the same query

Hi, this is just a question. I want to start experimenting with your library. Now I am not sure whether the input lines should have one line for each query, or that each query can have multiple lines. So if two visitors execute the same query, but do different clicks, is that two lines?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.