Git Product home page Git Product logo

talediff's Introduction

talediff

A prototype word embedding model based on type-level differential operators.

Acknowledgements

This repository contains code from Glove -- the contents of eval are essentially copied directly.

Concept

This project introduces a new way to reason formally about word embedding models. Abstractly, suppose for a family of languages L we have a vector space V of language features, and a scalar function T defined on V that describes something like the "expressive power" of L -- called the tale function, "tale" being a word that has long denoted both a count and an account. We are interested in the effect of "extending" a member of L along some dimension d of V. Concretely, suppose we take a member of L, l_1, and add new words, creating l_2. These two languages have associated feature vectors in V that are equal, except along the dimesion d, and so l_2 = l_1 + delta_d.

Now let us ask: what is the difference between these two languages in terms of "expressive power"? The fundamental assumption of this model is that the change in expressive power along the changed dimension tells us something important about the dimension itself and its relationship to the language. This amounts to differentiation:

dT/dV_d = Lim[delta_d -> 0][(T(l_1 + delta_d) - T(l_1)) / delta_d]

Clearly we can do this for any of the dimensions d in V, and so we can imagine a gradient operator that tells something about the way the tale fuction T behaves in the neighborhood of l_1. In particular, it tells us the direction of steepest ascent for T -- that is, the direction in language feature space V that points towards the most expressive of l_1's immediate neighbors.

Furthermore, we can imagine a second order differentiation operator, which would take the tale function T, and return a function that generates the Hessian matrix for T in the neighborhood of l. This would be the best linear approximation of the gradient of T in the neighborhood of l. Thus we can use the Hessian to tell us something interesting about the local "expressivity landscape" around l.

But what exactly does it tell us? What are the features in V? And what would make a good tale function T?

Word Embedding Vectors Are Type-Level Hessian Matrices

This project takes inspration from McBride (2001) and Abbot et. al. (2003), who show that algebraic data types are differentiable, and that all the familiar differential operators from calculus, including the gradient and Hessian operators, follow essentially identical rules. By formulating language families as type-level functions that generate composite types from primitive types, we can construct a range of tale functions, and perform exactly the above operations to generate linear approximations of their behavior in the neighborhoods of particular languages.

To begin with, it is easy to construct a tale function with the property that its Hessian is just a word-word coocurrence count matrix for a corpus. First, define a language as the sum type over all possible sentences in the language. Define a sentence as the product type over the words that constitute it. And define a word type as a singleton type. In other words, define a language as a large, high-dimensional type-level polynomial over singleton word types.

The second derivative of a sentence type defined in the above way is a Hessian matrix with zeros for all word pairs not in the sentece at all (because the derivative of a constant is zero), ones for all hetergeneous word pairs in the sentence where both of the given words appear just once in the sentence, and zeros for all homogeneous word "pairs" (diagonals) where the given word appears just once in the sentence. In (rare) cases where a word appears twice, there are potentially multiple rules for constructing a word-word cooccurrence count matrix; this formalism singles out a particular set of rules corresponding to the power rule in calculus. (For example, the diagonal entry for a word that appears twice will be two -- d/dx x^2 = 2x evaluated at x = 1.)

Thus for each sentence type in the language we can calculate a sparse Hessian matrix. Now note that differentiation is a linear operator, and we have defined our language as a sum over sentence types. To find the Hessian of the langauge, then, we can simply sum up the matrices for all the sentences.

As Goldberg and Levy (2014) and others have shown, many standard word embedding models are essentially just factorizations of this matrix using various weighting and normalization schemes. Goldberg and Levy ask "Why does this produce good word representations?" The above construction provides a new answer to that question, at least if we assume that "good" is an underspecified synonym for "linear":

The word-word coocurrence matrix is a linear operator that provides a good local approximation of the gradient of a function describing a language's expressive power with respect to the words in the language.

Deriving Novel Embedding and Reduction Schemes

In this word embedding model, the linearity of induced semantic structures directly emerges from the linearity of the differential operator. This is guaranteed by the logic of differentiation over types, which produces, in McBride's words, "one-hole contexts." That is, type-level differentiation gives us a way to extract linear structures from context types.

Given that linearity is a fundamental attribute of the word-word coocurrence matrix, we can expect many different dimension-reduction schemes to work reasonably well -- even random projection! Furthermore, we can expect that more complex sentence types, when they can be represented as algebriac data types, will admit methods similar to the above. For example, this approach should allow tree structures (branching recursive types) to be analyzed as easily as lists (unicursal recursive types). And finally, we can expect that choosing non-singleton word types will yield new and potentially useful results.

The code in this repository is under active development, and takes a few very small steps in the above directions.

talediff's People

Contributors

senderle avatar

Watchers

James Cloos avatar  avatar

talediff's Issues

Add an option for dense/negative random projections

When I was first thinking this through I felt very committed to the one-hole-context interpretation, to the point that I didn't want to add behaviors that seemed not to make sense in a type theoretical context. So I didn't use negative numbers in the random projection code because negation of types doesn't (yet) make sense. Instead I just sparsified the vectors by using nulls where there would normally be negatives. But now I think I'm more interested in developing the vector calculus interpretation, because the directional derivative idea makes sense even if we just think of the word expressivity values as ordinary scalars. And in that context the negative numbers serve a really important function! So I should add an option to turn negation back on.

Also, I should add some normalization options. In particular, in the directional derivative model, these should probably be normalized to be unit vectors. In fact, come to think of it, this could really be causing problems!

Revisit scaling

To get a pointwise mutual information matrix, take the hessian at 1, 1, 1..., scale it by a diagonal 1/word freq matrix on the left and the right, multiply the whole thing by the number of sentences, and take the logarithm.

Given how well this already works, there seems to be no reason to take the logarithm, and the previous step is just a constant multiplication. So the only part that might still matter is the two scaling multiplications. Add an option to do that. Should be as simple as left multiplying both the hessian and the projection vectors by the scaling matrix. (The left mul over the projection vectors becomes a right mul when the projection mul is executed.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.