Git Product home page Git Product logo

what_are_embeddings's Introduction

What are embeddings?

This repository contains the generated LaTex document, website, and complementary notebook code for "What are Embeddings".

DOI

Abstract

Over the past decade, embeddings --- numerical representations of non-tabular machine learning features used as input to deep learning models --- have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.

Google's Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.

Running

The LaTex document is written in Overleaf and deployed to GitHub, where it's compiled via Actions. The site is likewise generated via Actions from the site branch. The notebooks are flying fast and free and not under any kind of CI whatsoever.

Contributing

If you have any changes that you'd like to make to the document including clarification or typo fixes, you'll need to build the LaTeX artifact. I use GitHub to track issues and feature requests, as well as accept pull requests. Pull requests are the best way to propose changes to the codebase:

  1. Fork the repo and create your branch from main.
  2. Make your changes in your fork.
  3. Make sure that your LaTeX document compiles. The GH action that triggers the PDF is set to run on PR into main.
  4. Ensure that the document compiles to a PDF correctly and inspect the output.
  5. Make sure your code lints.
  6. Issue that pull request!

Citing

@software{Boykis_What_are_embeddings_2023,
author = {Boykis, Vicki},
doi = {10.5281/zenodo.8015029},
month = jun,
title = {{What are embeddings?}},
url = {https://github.com/veekaybee/what_are_embeddings},
version = {1.0.1},
year = {2023}
}

what_are_embeddings's People

Contributors

akgerber avatar ankush-chander avatar arfri avatar balpha avatar barrald avatar bfgray3 avatar dleybz avatar emlyn avatar graceunderfiero avatar krishanbhasin avatar michal-mmm avatar moshekaplan avatar pietroppeter avatar rohanalexander avatar schicho avatar tbayer avatar veekaybee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

what_are_embeddings's Issues

License clarification

On the web page (source here) the license is stated as CC-BY-SA 3.0 (Creative Commons, By Attribution, Share Alike 3.0) yet in the paper (here) the license is CC-BY-NC-SA 3.0 (Creative Commons, By Attribution, Non-Commercial, Share Alike 3.0).

If the intent is to restrict commercial re-use, then the text on the web page should be changed to CC-BY-NC-SA 3.0 to reflect what license the PDF document is under. If the intent is to allow for commercial re-use and to make the document libre/free in the open source sense, the license should be changed to CC-BY-SA 3.0 in the PDF document.

Give an intuition of the `dict_a` and `dict_b` in the tf-idf example

Loving your document so far, Vicky! (filing this as I read)

In the tf-idf example on page 33 the objects dict_a and dict_b appear all of a sudden. I could not derive from the example what they represent and had to open the total on example on github to understand. Even though it is a truncated implementation I think some explanation on these objects would enhance the example. If nothing else, by just adding a code comment what they are / where they are coming from.

Gensim Link broken?

Hi! I'm working through the embeddings write up and it's a fantastic read.

In my copy that I downloaded a little while ago, the hyperlink for "Gensim" on page 44 just goes to the first page of the document. I went to make a pr, but this line looks to have the correct url. I downloaded a fresh copy from the main page, but the link still goes to the top of the document and doesn't have hyperlink format.

image

I think the main site needs to get a fresh version of the pdf?

Kindle friendly PDF?

Hello!

Thanks for writing the book - the stuff I've read so far (the intro) seems very well explained and easy to read!

However, I'm trying to read it on Kindle Paperwhite and everything is just tiny, as the PDF pages are flowed for A4 pages.
Since it appears to be written in TeX, could you please generate a Kindle-friendly version? The screen dimensions are closest to B7 (and please use narrow margins, the device has a large bezel itself :) ).

Thanks again in any case!

Formula 7

I believe formula 7 should rather be a single vector: [12, 2, 5].
We start to have tensors when stacking embeddings of different users. Am i missing something?

text is a bit muddled in figure 26

It looks like there's a cut-and-paste mishap in figure 26, starting at line 1202 of the tex:

tfidf_df = pd.DataFrame(vector.toarray(), index=text_titles, columns=vectorizer.get_feature_names_out())

tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()
tfidf_df.T

# How common or unique a word is in a given document wrt to the vocabulary
dreams_langstonhughes    quote_william_blake    00_Document Frequency
bird                    0.172503              0.197242              2.0
broken                  0.242447              0.000000              1.0
cannot                  0.242447              0.000000              1.0
die                     0.242447              0.000000              1.0
\end{minted}

Possibly a bug in tf-idf

I believe there may be a bug in the idf function in https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_24_tf_idf_from_scratch.ipynb

Namely, the number of documents in which a term appears does not seem to be actually computed, just initialized to 0.

You could see that the results are wrong by looking at the printed tf-idf values for document 0 for words "a" and "bird": the printed values are equal, which doesn't make sense: although tf for both of them is the same (since they each appear in the document0 once), idf should be different, since "a" appears only in document0, and "bird" appears in document0 and document1.

Please include a link to the website in the PDF

I really appreciate your paper on embeddings, but I feel that a paragraph in the introduction that links back to the book's website along with a version indicator would help me establish if the PDF I have is current and can be read as is, or if it is out dated and should be deleted and re-downloaded.

Also, it helps if I send the PDF to someone instead of just sending them a link to your website. =)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.