Git Product home page Git Product logo

video-pretrained-transformer's Introduction

πŸ“ƒ Intuition

I'm building my own multi media GPT; a competitor to Merlot Reserve & Vid2Seq. It's pre-trained from scratch on youtube data, mostly the YT-1B dataset of 20M curated youtube videos containing significant spoken language (English only).

πŸ“œ Arxiv: https://arxiv.org/abs/2304.10505

πŸ‘‰ Project highlights & intuition with photos, check it out: https://twitter.com/KastanDay/status/1595991960380411905

(No 3D) VPT Architecture Diagram

My design follows the "Embedding + Trunk + Head" pattern I first noticed succeeding in DETER and Alphafold2. Now in early 2023, it's successful in PALM-E and Vid2Seq from Google, and Prismer from Nvidia, and many more listed on my Twitter announcement.

πŸš€ Quickstart

  1. Install Git LFS
# Install `git-lfs` (via apt or brew)
brew install git-lfs
-OR-
conda install -c conda-forge -y git-lfs

Then start GitLFS

git-lfs install
  1. Install ffmpeg

A simple install should work fine, despite how convoluted the library tends to be.

# preffered
sudo apt update && sudo apt install ffmpeg
-OR-
# conda method is not as well tested for this project
conda install -c conda-forge -y ffmpeg
# An update command might be necessary to get all of ffmpeg's codec-specifc extensions, which we need. 
# solves error in parallel_whisper.py: ❌❌Error during whisper: Expecting value: line 1 column 1 (char 0)
conda update ffmpeg
  1. Clone the repo with our custom submodules
git clone --recurse-submodules [email protected]:KastanDay/video-pretrained-transformer.git
  1. Install pip requirements
pip install -r ./requirements.txt

Later, if updates are made to submodules, you can pull new changes using:

git submodule update --remote

We have submodules in the first place because we needed to modify the internal logic of three libraries used in preprocessing: Lhotse (to be faster), OpenPSG, and Transformers to modify the T5 implementation to suport modality encodings.

Install is complete!

Progress

  1. (Oct 2022) Start of project.
  2. (Dec 2022) MVP completed, but messed up the evaluation.
  3. (Dec 2022) Migrated all data to Deeplake database library, overall much cleaner & more reliable for distributed database updates.
  4. (Jan 2023) Migrated all training logic to Composer, by MosaicML. Super cool library for efficient LLM training, even of huggingface models.
  5. (Jan 2023) Finished scaling up distributed pre-processing (i.e. inference w/ Whisper, FlanT5, OpenS and Clip). Rock solid Deeplake distributed dataset.append() operations on any size SLURM cluster.
  6. (Feb 2023) Tested different backbones: T5 vs T5 v1.1 vs Flan-TS. Somehow, v1.1 was terrible and Flan-T5 was by far the best. As suggested by another finetuning study. The author confirmed this in my follow-up question.
  7. (Mar 2023) WIP: TVQA evaluation. Need to fit more video frames into our 1024 context window, probably by using fewer final hidden states from CLIP.

Up next:

  • Find better scene-graph implementation: conly 55 classes from COCO is not enough for YouTube data. Ours relies on Detectron2 as a base, which is great for in-domain objects but not general. I think the best we can do is to use the 1k classes from imagenet.
  • Totally reimplement sound/audio model to move away from Whisper -- I think Google's AudioSet with 600+ classes based on YouTube data, will enable the best models. Here's my favorite from that competition.

video-pretrained-transformer's People

Contributors

kastanday avatar danielchristl avatar rcsalvi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.