📃 Intuition

I'm building my own multi media GPT; a competitor to Merlot Reserve & Vid2Seq. It's pre-trained from scratch on youtube data, mostly the YT-1B dataset of 20M curated youtube videos containing significant spoken language (English only).

📜 Arxiv: https://arxiv.org/abs/2304.10505

👉 Project highlights & intuition with photos, check it out: https://twitter.com/KastanDay/status/1595991960380411905

My design follows the "Embedding + Trunk + Head" pattern I first noticed succeeding in DETER and Alphafold2. Now in early 2023, it's successful in PALM-E and Vid2Seq from Google, and Prismer from Nvidia, and many more listed on my Twitter announcement.

🚀 Quickstart

Install Git LFS

# Install `git-lfs` (via apt or brew)
brew install git-lfs
-OR-
conda install -c conda-forge -y git-lfs

Then start GitLFS

git-lfs install

Install ffmpeg

A simple install should work fine, despite how convoluted the library tends to be.

# preffered
sudo apt update && sudo apt install ffmpeg
-OR-
# conda method is not as well tested for this project
conda install -c conda-forge -y ffmpeg
# An update command might be necessary to get all of ffmpeg's codec-specifc extensions, which we need. 
# solves error in parallel_whisper.py: ❌❌Error during whisper: Expecting value: line 1 column 1 (char 0)
conda update ffmpeg

Clone the repo with our custom submodules

git clone --recurse-submodules [email protected]:KastanDay/video-pretrained-transformer.git

Install pip requirements

pip install -r ./requirements.txt

Later, if updates are made to submodules, you can pull new changes using:

git submodule update --remote

We have submodules in the first place because we needed to modify the internal logic of three libraries used in preprocessing: Lhotse (to be faster), OpenPSG, and Transformers to modify the T5 implementation to suport modality encodings.

Install is complete!

Progress

(Oct 2022) Start of project.
(Dec 2022) MVP completed, but messed up the evaluation.
(Dec 2022) Migrated all data to Deeplake database library, overall much cleaner & more reliable for distributed database updates.
(Jan 2023) Migrated all training logic to Composer, by MosaicML. Super cool library for efficient LLM training, even of huggingface models.
(Jan 2023) Finished scaling up distributed pre-processing (i.e. inference w/ Whisper, FlanT5, OpenS and Clip). Rock solid Deeplake distributed dataset.append() operations on any size SLURM cluster.
(Feb 2023) Tested different backbones: T5 vs T5 v1.1 vs Flan-TS. Somehow, v1.1 was terrible and Flan-T5 was by far the best. As suggested by another finetuning study. The author confirmed this in my follow-up question.
(Mar 2023) WIP: TVQA evaluation. Need to fit more video frames into our 1024 context window, probably by using fewer final hidden states from CLIP.

Up next:

Find better scene-graph implementation: conly 55 classes from COCO is not enough for YouTube data. Ours relies on Detectron2 as a base, which is great for in-domain objects but not general. I think the best we can do is to use the 1k classes from imagenet.
Totally reimplement sound/audio model to move away from Whisper -- I think Google's AudioSet with 600+ classes based on YouTube data, will enable the best models. Here's my favorite from that competition.

ryddle / video-pretrained-transformer Goto Github PK

video-pretrained-transformer's Introduction

📃 Intuition

🚀 Quickstart

Progress

video-pretrained-transformer's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent