Light

faisal-hajari / kd Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 62.52 MB

Python 99.84% Shell 0.16%

kd's Introduction

Area for research:

CLIP in point-cloud/3D.
Open-Vocabulary Object Detection (OVD)
efficient CLIP training (better use of computation or data)
applying CLIP models in narrow fields; such as Human Object Interaction detection, crowd counting...etc

Papers from CVPR2023:

(might missed some papers)

pretraining CLIP models:

Title	Description	Code
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training	Reducing memory consumption through decomposing the gradient	code
Scaling Language-Image Pre-training via Masking	by adding masked image modelling to the image branch of clip it improved speed, memory, and performance	code
Non-Contrastive Learning Meets Language-Image Pre-Training	added the loss introduced in SwAv (based on cluster assignment agreement) in addition to the contrastive loss of CLIP. interestingly, if non-Contrastive loss is used alone the zero-shot performance is bad but when used with contrastive loss (0.7swav + 0.3contrastive) it over perform the contrastive loss. Additionally, it helped the need for data (trained on 35-million only) and small batch size (4096 combared to 32K)	code

Finetuning CLIP models:

Title	Description	Code
Learning to Name Classes for Vision and Language Models	created a learnable token embedding for the class names in otherwise frozen clip model, reduce the need for prompt engineering	NA
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models	when fine-tuning the model with linear classifier it is useful to train it from multi modality	NA
MaPLe: Multi-modal Prompt Learning	learnable prompts on both the image and text branches, image prompt are derived from a linear layer that takes the text prompt as input	code

CLIP in video:

Title	Description	Code
Fine-Tuned CLIP Models Are Efficient Video Learners	Adapts clip for videos. Claims that frame level clip embeddings from the videos though processed independently can still show temporal dependencies. Claims that instead of devising certain specific modules to address the temporal dependency in videos, simply fine-tuning ViFiCLIP can generalise to good performance. They do temporal pooling meaning pool embeddings from T frames and use that embedding in the contrastive learning process. This is probably why the embeddings are consistent with image based CLIP.	code
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting	Performs prompt learning on the video data to better fine tune image based CLIP model for videos. Same authors as of ViFi CLIP (above) Need to look into how the prompts are actually learned.	code

Crowd Counting:

Title	Description	Code
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model	crowd counting with clip. fine-tune clip for the counting task using ranking loss. Does not use labels of people counts as ground truth for training. uses a sequential prompting setting to filter parts that only contain people heads for counting	code

Generative:

Title	Description	Code
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency	...	...
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language	...	...
CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search	...	...
Local 3D Editing via 3D Distillation of CLIP Knowledge	...	...

Continual learning:

Title	Description	Code
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning	Used prompt tuning with CLIP to solve the problem of Continual learning, heavily inspired by CoOp	code

3D and Point-cloud:

Title	Description	Code
CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data	...	...

Detection:

Title	Description	Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

kd's People

Contributors

Stargazers

Watchers

Forkers

kd's Issues

problem with DiscoCLIP

when using diso gather the loss get stuck at 4.5 while without we get 0.003.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.