Git Product home page Git Product logo

colbert-ai's Introduction

Colbert AI

GitHub issues GitHub forks GitHub license

ColbertAI


What is Colbert AI?

Colbert AI is a Deep Learning Language Model that generates text in the style of Stephen Colbert's famous monologues.

How did we build it?

We used State of the Art Deep Learning Language model: Open AI's GPT-2 and Fine Tuned it using text from YouTube video captions.

Technical Details

Libraries used

Downloading Video Captions From YouTube

  • The playlist is specified by PLAYLIST_URL in download.py
  • youtube_dl module to download captions of each video from the playlist and saving all of them in data/captions folder

Generating Text Corpus from Captions

  • We only looked for text where the speaker was Stephen Colbert
  • Individual captions were merged into single file, separated by an End of Text Marker

Usage Guide:

Using Colbert-AI to generate text based on Stephen Colbert's monologues.

  • Clone this repository, using:
    git clone https://github.com/NextTechLabAP/Colbert-AI.git
    
  • Install all requirements on requirements.txt using:
    pip install -r requirements.txt
    
  • Run python3 download.py to download the captions
  • Run python3 caption_processing.py to process the captions
  • Open the Colbert-AI-v2.ipynb Jupyter Notebook
  • Change path to captions.txt
  • Rull all cells

Using Colbert-AI to generate text based on a custom text corpus

  • Clone this repository, using:
    git clone https://github.com/NextTechLabAP/Colbert-AI.git
    
  • Open the Colbert-AI-v2.ipynb Jupyter Notebook
  • Change path from captions.txt to the Custom Text Corpus file
  • Rull all cells

The Model

GPT-2

GPT-2 has 4 different models:

  • GPT-2 Small (124M Model)
  • GPT-2 Medium (345M Model)
  • GPT-2 Large (774M Model)
  • GPT-2 Extra Large (1558M Model)

We used GPT-2 Medium for our use case since we focused on building a lighter model so we could fine-tune it further.

Functions Used

choose_from_top(Probability, N):
  • Function to first select top N tokens from the probability list and then based on the selected N-word distribution
generate_text(Input_Text, Length) :
  • At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text.

Generating Text

Text Can be Generated using generate_text. One of the Text Samples generated using prompt "Artificial Intelligence is ":
  • Artificial general intelligence is the most likely future of the human race; it's a science which is not just possible but inevitable."

Fine-Tuning

  • Dataset has been preprocessed and prepared in Text_Corpus class.
  • Variable Hyperparameters
    • BATCH_SIZE = (1)
    • EPOCHS = (30)
    • LEARNING_RATE = (1e-5)
    • WARMUP_STEPS = (10000)
    • MAX_SEQ_LEN = (550)
Training the Model

We trained the model and saved the model weights after each epoch. Then we generated Text Samples from the saved weights.

Results
  • Now, there are some people out there who think trump's a bad person. For instance, this weekend, I watched the presidential candidate's first candidate round-up, and he was named "The man who can't get anything he wants to get right." ( cheers and applause ) that's a good quality. That's a good quality, because the only person who can't get anything right is Donald Trump. ( laughter ) and I'm not sure he's read the new book, "The man who can't get anything wrong."

  • This is a big day for the president of the united states. Trump is about to be released from impala. (laughter) (applause) and this is huge news because this is a big week for him because the court has decided that he can no longer use the n-word, because, in a letter to his staff, the president said, "If I didn't use the n-word, then why are all the other white house staff members calling me a cuck?!" (laughter) (applause) (cheers and applause) (piano riff) and trump's not the only person who has been in jail for the "N-word." last week, Austin turns out to be a founder of "N-god," which was also the name of a movie. (cheers and applause) and now trump is going to have a new "N-god." (laughter) and, of course, the "N-god"

Contributors

Mentions:

colbert-ai's People

Contributors

cshubhamrao avatar dexter2389 avatar iam-abbas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

colbert-ai's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.