Git Product home page Git Product logo

embed2word's Introduction

embed2word

Turning natural language text into numerical representations has been a focus of NLP for quite some time. The idea is to represent each word with a vector that encapsulates the word's meaning. In the past, word vectors like GLOVE proved useful, although they did not capture "context". With the advent of large language models like GPT, contextual word representations became more prevalent. Here we intend to use the GPT2 model to represent word vectors. This work was done previously, based on a conversation in 2019: huggingface/transformers#1458 (comment). This repo ispired by work of @MF-FOOM with: https://github.com/MF-FOOM/wikivec2text

Manipulating word embedding vectors and then converting those vectors back to words is known as semantic arithmetic. Typically, word vectors are low dimensional representations of tokens, which are not necessarily invertible. Here we are using GPT2 as the base model, so having the GPT2LMHeadModel and GPT2Tokenizer is necessary. The PyTorch library is also used.

process

The idea is to start from text where the goal is to replace one sentiment with another. In order to do that, we first turn all text and sentiment words into vectors. We perform the arithmetic operations, and then project back to the word vocabulary. GPT2 has a limited vocabulary of around 50K, which is significantly less than GPT3's 14,735M vocabulary. Therefore, we do not expect it to perform well, and it does not.

example

$ python test_embed2vec.py "This was a good resturant. Their ramen is great." "great" "horrible"

This was horrible horrible resturant. Their ramen is horrible.

embed2word's People

Contributors

fqassemi avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.