Git Product home page Git Product logo

kd's Introduction

Area for research:

  1. CLIP in point-cloud/3D.
  2. Open-Vocabulary Object Detection (OVD)
  3. efficient CLIP training (better use of computation or data)
  4. applying CLIP models in narrow fields; such as Human Object Interaction detection, crowd counting...etc

Papers from CVPR2023:

(might missed some papers)

pretraining CLIP models:

Title Description Code
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training Reducing memory consumption through decomposing the gradient code
Scaling Language-Image Pre-training via Masking by adding masked image modelling to the image branch of clip it improved speed, memory, and performance code
Non-Contrastive Learning Meets Language-Image Pre-Training added the loss introduced in SwAv (based on cluster assignment agreement) in addition to the contrastive loss of CLIP. interestingly, if non-Contrastive loss is used alone the zero-shot performance is bad but when used with contrastive loss (0.7swav + 0.3contrastive) it over perform the contrastive loss. Additionally, it helped the need for data (trained on 35-million only) and small batch size (4096 combared to 32K) code

Finetuning CLIP models:

Title Description Code
Learning to Name Classes for Vision and Language Models created a learnable token embedding for the class names in otherwise frozen clip model, reduce the need for prompt engineering NA
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models when fine-tuning the model with linear classifier it is useful to train it from multi modality NA
MaPLe: Multi-modal Prompt Learning learnable prompts on both the image and text branches, image prompt are derived from a linear layer that takes the text prompt as input code

CLIP in video:

Title Description Code
Fine-Tuned CLIP Models Are Efficient Video Learners Adapts clip for videos. Claims that frame level clip embeddings from the videos though processed independently can still show temporal dependencies. Claims that instead of devising certain specific modules to address the temporal dependency in videos, simply fine-tuning ViFiCLIP can generalise to good performance. They do temporal pooling meaning pool embeddings from T frames and use that embedding in the contrastive learning process. This is probably why the embeddings are consistent with image based CLIP. code
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting Performs prompt learning on the video data to better fine tune image based CLIP model for videos. Same authors as of ViFi CLIP (above) Need to look into how the prompts are actually learned. code

Crowd Counting:

Title Description Code
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model crowd counting with clip. fine-tune clip for the counting task using ranking loss. Does not use labels of people counts as ground truth for training. uses a sequential prompting setting to filter parts that only contain people heads for counting code

Generative:

Title Description Code
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency ... ...
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language ... ...
CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search ... ...
Local 3D Editing via 3D Distillation of CLIP Knowledge ... ...

Continual learning:

Title Description Code
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning Used prompt tuning with CLIP to solve the problem of Continual learning, heavily inspired by CoOp code

3D and Point-cloud:

Title Description Code
CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data ... ...

Detection:

Title Description Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

kd's People

Contributors

faisal-hajari avatar sulrash avatar sara-sulaiman avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

sa4d

kd's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.