Git Product home page Git Product logo

flmr's Introduction

FLMR

The huggingface-transformers implementation of Fine-grained Late-interaction Multi-modal Retriever.

The official implementation is at here.

The details of the model and checkpoints can be found here.

The details for reproducing the datasets and evaluation in the paper can be found here.

Updates

  • [03/09/2024] We have uploaded the images used in the M2KR benchmark here .
  • [10/08/2024] We received many requests regarding adding multilingual abilities to PreFLMR. We announce that we are now training the Chinese version of PreFLMR and will release it very soon. Stay tuned!
  • [05/06/2024] 🔥🔥🔥We made some updates to the implementation
    • Added an evaluation script that reproduces the results in the PreFLMR paper here
    • Added the updated benchmark results with the transformer implementation here
    • Added an example script to fine-tune PreFLMR on a custom retrieval dataset here
    • IMPORTANT: fixed the OVEN data splits in the M2KR benchmark, and updated each entry with a fixed instruction to ensure the evaluation result is not affected by random sampling of instructions. Please delete your local cache and download the dataset again.

Table of Contents

Models and Benchmark Results

Model WIT Recall@10 IGLUE Recall@1 KVQA Recall@5 MSMARCO Recall@5 OVEN Recall@5 LLaVA Recall@1 EVQA Recall@5 EVQA Pseudo Recall@5 OKVQA Recall@5 OKVQA Pseudo Recall@5 Infoseek Recall@5 Infoseek Pseudo Recall@5
LinWeizheDragon/PreFLMR_ViT-G🤗 0.619 0.718 0.419 0.783 0.643 0.726 0.625 0.721 0.302 0.674 0.392 0.577
LinWeizheDragon/PreFLMR_ViT-L🤗 0.605 0.699 0.440 0.779 0.608 0.729 0.609 0.708 0.314 0.690 0.374 0.578
LinWeizheDragon/PreFLMR_ViT-B🤗 0.427 0.574 0.294 0.786 0.468 0.673 0.550 0.663 0.272 0.658 0.260 0.496

Note: We converted the checkpoints from PyTorch to Huggingface-transformers, whose benchmark results differ from the numbers reported in the original paper slightly. You can reproduce the results in the above paper by referring to the instructions in this document.

How to use this package

Environment

Create virtualenv:

conda create -n FLMR python=3.10 -y
conda activate FLMR

Install Pytorch:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install faiss

conda install -c pytorch -c nvidia faiss-gpu=1.7.4 mkl=2021 blas=1.0=mkl

Test if faiss generate error

python -c "import faiss"

Install FLMR

git clone https://github.com/LinWeizheDragon/FLMR.git
cd FLMR
pip install -e .

Install ColBERT engine

cd third_party/ColBERT
pip install -e .

Install other dependencies

pip install ujson gitpython easydict ninja datasets transformers

Index a custom document collection

  1. Load pre-trained models

    import os
    import torch
    import pandas as pd
    import numpy as np
    from torchvision.transforms import ToPILImage
    from transformers import AutoImageProcessor
    
    from flmr import index_custom_collection
    from flmr import FLMRQueryEncoderTokenizer, FLMRContextEncoderTokenizer, FLMRModelForRetrieval
    
    # load models
    checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-G"
    image_processor_name = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"
    
    query_tokenizer = FLMRQueryEncoderTokenizer.from_pretrained(checkpoint_path, subfolder="query_tokenizer")
    context_tokenizer = FLMRContextEncoderTokenizer.from_pretrained(
        checkpoint_path, subfolder="context_tokenizer"
    )
    
    model = FLMRModelForRetrieval.from_pretrained(
        checkpoint_path,
        query_tokenizer=query_tokenizer,
        context_tokenizer=context_tokenizer,
    )
    image_processor = AutoImageProcessor.from_pretrained(image_processor_name)
  2. Create document collections

    num_items = 100
    feature_dim = 1664
    passage_contents = [f"This is test sentence {i}" for i in range(num_items)]
    # Option 1. text-only documents
    custom_collection = passage_contents
    # Option 2. multi-modal documents with pre-extracted image features
    # passage_image_features = np.random.rand(num_items, feature_dim)
    # custom_collection = [
    #     (passage_content, passage_image_feature, None) for passage_content, passage_image_feature in zip(passage_contents, passage_image_features)
    # ]
    # Option 3. multi-modal documents with images
    # random_images = torch.randn(num_items, 3, 224, 224)
    # to_img = ToPILImage()
    # if not os.path.exists("./test_images"):
    #     os.makedirs("./test_images")
    # for i, image in enumerate(random_images):
    #     image = to_img(image)
    #     image.save(os.path.join("./test_images", "{}.jpg".format(i)))
    
    # image_paths = [os.path.join("./test_images", "{}.jpg".format(i)) for i in range(num_items)]
    
    # custom_collection = [
    #     (passage_content, None, image_path)
    #     for passage_content, image_path in zip(passage_contents, image_paths)
    # ]
  3. Run indexing on the custom collection

    index_custom_collection(
        custom_collection=custom_collection,
        model=model,
        index_root_path=".",
        index_experiment_name="test_experiment",
        index_name="test_index",
        nbits=8, # number of bits in compression
        doc_maxlen=512, # maximum allowed document length
        overwrite=True, # whether to overwrite existing indices
        use_gpu=False, # whether to enable GPU indexing
        indexing_batch_size=64,
        model_temp_folder="tmp",
        nranks=1, # number of GPUs used in indexing
    )

Search a custom document collection

  1. Create toy query data

    num_queries = 2
    
    query_instructions = [f"instruction {i}" for i in range(num_queries)]
    query_texts = [f"{query_instructions[i]} : query {i}" for i in range(num_queries)]
    query_images = torch.zeros(num_queries, 3, 224, 224)
    query_encoding = query_tokenizer(query_texts)
    query_pixel_values = image_processor(query_images, return_tensors="pt")['pixel_values']
  2. Obtain query embeddings with model

    inputs = dict(
        input_ids=query_encoding['input_ids'],
        attention_mask=query_encoding['attention_mask'],
        pixel_values=query_pixel_values,
    )
    
    # Run model query encoding
    res = model.query(**inputs)
    
    queries = {i: query_texts[i] for i in range(num_queries)}
    query_embeddings = res.late_interaction_output
  3. Search the collection

    from flmr import search_custom_collection, create_searcher
    
    # initiate a searcher
    searcher = create_searcher(
        index_root_path=".",
        index_experiment_name="test_experiment",
        index_name="test_index",
        nbits=8, # number of bits in compression
        use_gpu=True, # whether to enable GPU searching
    )
    # Search the custom collection
    ranking = search_custom_collection(
        searcher=searcher,
        queries=queries,
        query_embeddings=query_embeddings,
        num_document_to_retrieve=5, # how many documents to retrieve for each query
    )
    
    # Analyse retrieved documents
    ranking_dict = ranking.todict()
    for i in range(num_queries):
        print(f"Query {i} retrieved documents:")
        retrieved_docs = ranking_dict[i]
        retrieved_docs_indices = [doc[0] for doc in retrieved_docs]
        retrieved_doc_scores = [doc[2] for doc in retrieved_docs]
        retrieved_doc_texts = [passage_contents[doc_idx] for doc_idx in retrieved_docs_indices]
    
        data = {
            "Confidence": retrieved_doc_scores,
            "Content": retrieved_doc_texts,
        }
    
        df = pd.DataFrame.from_dict(data)
    
        print(df)

Training with contrastive learning

import torch
from flmr import FLMRQueryEncoderTokenizer, FLMRContextEncoderTokenizer, FLMRModelForRetrieval

checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-L"
image_processor_name = "openai/clip-vit-large-patch14"
query_tokenizer = FLMRQueryEncoderTokenizer.from_pretrained(checkpoint_path, subfolder="query_tokenizer")
context_tokenizer = FLMRContextEncoderTokenizer.from_pretrained(checkpoint_path, subfolder="context_tokenizer")

model = FLMRModelForRetrieval.from_pretrained(checkpoint_path,
                                                query_tokenizer=query_tokenizer,
                                                context_tokenizer=context_tokenizer,
                                                )

Q_encoding = query_tokenizer(["Using the provided image, obtain documents that address the subsequent question: What is the capital of France?", "Extract documents linked to the question provided in conjunction with the image: What is the capital of China?"])
D_encoding = context_tokenizer(["Paris is the capital of France.", "Beijing is the capital of China.",
                            "Paris is the capital of France.", "Beijing is the capital of China."])
Q_pixel_values = torch.zeros(2, 3, 224, 224)
inputs = dict(
    query_input_ids=Q_encoding['input_ids'],
    query_attention_mask=Q_encoding['attention_mask'],
    query_pixel_values=Q_pixel_values,
    context_input_ids=D_encoding['input_ids'],
    context_attention_mask=D_encoding['attention_mask'],
    use_in_batch_negatives=True,
)

res = model.forward(**inputs)
print(res)

Note that the examples in this code block are only for demonstration purposes. They show that the pre-trained model gives higher scores to correct documents. In real training, you always need to pass in the documents in the order "positive doc for query1, negative doc1 for query1, negative doc2 for query1, ..., positive doc for query2, negative doc1 for query2, negative doc2 for query2, ...". You may want to read the later section which provides an example finetuning script.

Alternative: use transformers.AutoModel to load pre-trained models

pip install transformers
from transformers import AutoConfig, AutoModel, AutoImageProcessor, AutoTokenizer
import torch

checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-L"
image_processor_name = "openai/clip-vit-large-patch14"
query_tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, subfolder="query_tokenizer", trust_remote_code=True)
context_tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, subfolder="context_tokenizer", trust_remote_code=True)

model = AutoModel.from_pretrained(checkpoint_path,
                                query_tokenizer=query_tokenizer,
                                context_tokenizer=context_tokenizer,
                                trust_remote_code=True,
                                )
image_processor = AutoImageProcessor.from_pretrained(image_processor_name)

Use example scripts

We provide two scripts to show how the pretrained models can be used in evaluation:

  1. examples/example_use_flmr.py: an example script to evaluate FLMR (with 10 ROIs) on OK-VQA.
  2. examples/example_use_preflmr.py: an example script to evaluate PreFLMR on E-VQA.

Use FLMR

cd examples/

Download KBVQA_data from here and unzip the image folders. The ROI/captioning/object detection results have been included.

Run the following command (remove --run_indexing if you have already run indexing once):

python example_use_flmr.py \
            --use_gpu --run_indexing \
            --index_root_path "." \
            --index_name OKVQA_GS\
            --experiment_name OKVQA_GS \
            --indexing_batch_size 64 \
            --image_root_dir /path/to/KBVQA_data/ok-vqa/ \
            --dataset_path BByrneLab/OKVQA_FLMR_preprocessed_data \
            --passage_dataset_path BByrneLab/OKVQA_FLMR_preprocessed_GoogleSearch_passages \
            --use_split test \
            --nbits 8 \
            --Ks 1 5 10 20 50 100 \
            --checkpoint_path LinWeizheDragon/FLMR \
            --image_processor_name openai/clip-vit-base-patch32 \
            --query_batch_size 8 \
            --num_ROIs 9 \

[NEW!] Use PreFLMR

You can download the E-VQA images from https://github.com/google-research/google-research/tree/master/encyclopedic_vqa. We will add a dataset link here soon.

cd examples/

Run the following command (remove --run_indexing if you have already run indexing once):

python example_use_preflmr.py \
            --use_gpu --run_indexing \
            --index_root_path "." \
            --index_name EVQA_PreFLMR_ViT-G \
            --experiment_name EVQA \
            --indexing_batch_size 64 \
            --image_root_dir /rds/project/rds-hirYTW1FQIw/shared_space/vqa_data/KBVQA_data/EVQA/eval_image/ \
            --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR \
            --dataset EVQA \
            --use_split test \
            --nbits 8 \
            --Ks 1 5 10 20 50 100 500 \
            --checkpoint_path LinWeizheDragon/PreFLMR_ViT-G \
            --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k \
            --query_batch_size 8 \
            --compute_pseudo_recall \

Here, we upload all the M2KR datasets into one HF dataset BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR with different datasets as subset. To reproduce results of the other datasets in the paper, you can change the --dataset to OKVQA, KVQA, LLaVA, OVEN, Infoseek, WIT, IGLUE and EVQA.

Updates:

  • Enable --compute_pseudo_recall to compute pseudo recall for datasets like EVQA/OKVQA/Infoseek
  • Enable --Ks 1 5 10 20 50 100 500: max(Ks) needs to be 500 to match the performance reported in the PreFLMR paper.

[NEW!] Evaluate the PreFLMR models on all M2KR benchmarks

Change the image root paths in examples/evaluate_all.sh and execute:

cd examples
bash evaluate_all.sh

Obtain the report by:

python report.py

[NEW!] Finetune the PreFLMR model on downstream datasets

You will need to install pytorch-lightning:

pip install pytorch-lightning==2.1.0

Run finetuning

python example_finetune_preflmr.py \
    --image_root_dir /path/to/EVQA/images/ \
    --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR \
    --dataset EVQA \
    --freeze_vit \
    --log_with_wandb \
    --model_save_path saved_models \
    --checkpoint_path LinWeizheDragon/PreFLMR_ViT-G \
    --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k \
    --batch_size 8 \
    --accumulate_grad_batches 8 \
    --valid_batch_size 16 \
    --test_batch_size 64 \
    --mode train \
    --max_epochs 99999999 \
    --learning_rate 0.000005 \
    --warmup_steps 100 \
    --accelerator auto \
    --devices auto \
    --strategy ddp_find_unused_parameters_true \
    --num_sanity_val_steps 2 \
    --precision bf16 \
    --val_check_interval 2000 \
    --save_top_k -1 \

Run Testing

python example_use_preflmr.py \
    --use_gpu --run_indexing \
    --index_root_path "." \
    --index_name EVQA_PreFLMR_ViT-G_finetuned_model_step_10156 \
    --experiment_name EVQA \
    --indexing_batch_size 64 \
    --image_root_dir /path/to/EVQA/images/ \
    --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR \
    --dataset EVQA \
    --use_split test \
    --nbits 8 \
    --num_gpus 1 \
    --Ks 1 5 10 20 50 100 500 \
    --checkpoint_path saved_models/model_step_10156 \
    --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k \
    --query_batch_size 8 \

Example finetuning results

By running the above script, we are able to obtain the following finetuning performance:

Step Pseudo Recall@5 on EVQA
2500 73.6
10000 73.55
12000 74.21
14000 73.73

(Checkpoints with low validation losses were picked and tested, run on 2 A100 GPUs)

Screenshot 2024-06-05 171340

Note

The FLMR model is implemented following the documentation style of transformers. You can find detailed documentation in the modeling files.

Citation

If our work helped your research, please kindly cite our paper for FLMR and PreFLMR.

@inproceedings{
    lin2023finegrained,
    title={Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering},
    author={Weizhe Lin and Jinghong Chen and Jingbiao Mei and Alexandru Coca and Bill Byrne},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=IWWWulAX7g}
        }
        
@inproceedings{lin-etal-2024-preflmr,
    title = "{P}re{FLMR}: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers",
    author = "Lin, Weizhe  and
      Mei, Jingbiao  and
      Chen, Jinghong  and
      Byrne, Bill",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.289",
    pages = "5294--5316",
    abstract = "Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.",
}

flmr's People

Contributors

linweizhedragon avatar jingbiaomei avatar erichen0615 avatar

Stargazers

Saicoco avatar  avatar  avatar  avatar Haoze0102 avatar Czi. avatar WoodenPig avatar  avatar Davide Caffagni avatar Xiang Liu avatar JerExJs avatar MagicSource avatar jiandong avatar nihao avatar  avatar bfsmkk avatar henry994 avatar  avatar Gary Feng avatar 墨问 avatar 爱可可-爱生活 avatar Jinjing Zhou avatar  avatar 唐国梁Tommy avatar Ramsey avatar Duanrui Yu avatar Bobby Chen avatar Enze avatar Haotian NI avatar duyc168 avatar CH7au avatar  avatar  avatar  avatar Cheyanne Lee avatar Ting Han avatar  avatar masa-erland avatar Neo avatar Jeff Carpenter avatar tuofeilun avatar  avatar Yan Yibin avatar Haotian Wang avatar  avatar Arnav Wadehra avatar Wenbin An avatar  avatar JIMMY ZHAO avatar  avatar  avatar  avatar  avatar  avatar Haitao Jiang avatar  avatar yydxlv avatar Titus avatar  avatar Yeong-Joon Ju (주영준) avatar Zmu avatar  avatar  avatar Jian Pu avatar

Watchers

MagicSource avatar  avatar  avatar

flmr's Issues

question about details of finetuning script

hi lin, i managed to write a finetuning script, could you help me check it? i also got confused about some details, listed below(also marked with NOTE in code comments), could you illustrate somehow? thanks!

  1. in preflmr finetuning (paper B.2.), infoseek was finetuned on 4 gpus with batch size 8 and gradient accumulation step 8, thus batch size per step is 4 * 8 * 8 = 256. and infoseek was finetuned on 1k steps, adds up to 256 * 1k = 256k examples.
    however in the m2kr train datasheet, infoseek has 100k examples (in hf repo it is 600k actually).
    is the 256k examples is made up mutiple epochs of 100k examples, or sampled from 600k?
  2. in training, an example in train dataset has mulitple positive passages (stored in pos_item_contents), is it sample by random from pos_item_contents in dataset.__getitem__?
  3. in training, an example needs 4 negatives passages, are those sampled by random from non-pos passages in knowledge base?
  4. in collate_fn, in_batch_negatives_from_all_gpus should be True or False (by default it is False).
import transformers
from transformers import TrainingArguments, Trainer, HfArgumentParser
from transformers import AutoImageProcessor
from datasets import load_dataset

import torch
import torch.nn.functional as F
import torch.distributed as dist
from torch.utils.data import Dataset

import random
import os
from PIL import Image
from pprint import pformat
from dataclasses import dataclass

from flmr import FLMRQueryEncoderTokenizer, FLMRContextEncoderTokenizer, FLMRModelForRetrieval


@dataclass
class MyArguments:
    model_name_or_path :str = "LinWeizheDragon/PreFLMR_ViT-G"
    image_processor_name :str = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"
    dataset_name :str = "BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR"
    dataset_subset_name :str = "Infoseek"
    query_images_dir :str = "./Infoseek/train_images"
    num_negative_examples :int = 4


@dataclass
class PreFLMRTrainingArguments(TrainingArguments):
    # set according to PreFLMR paper 
    remove_unused_columns :bool = False
    per_device_train_batch_size :int = 8
    gradient_accumulation_steps :int = 8
    logging_steps :int = 1
    eval_strategy :str = 'no'
    save_strategy :str = 'steps'
    save_steps :int = 500
    max_steps :int = 1000
    save_only_model :bool = True
    save_total_limit :int = 5
    seed :int = 42
    # manually set optimizer later
    mapping_structure_lr :float = 1e-4
    non_mapping_structure_lr :float = 1e-5


class PreFLMRTrainer(Trainer):

    def compute_loss(self, model, inputs, return_outputs=False):
        outputs = model(**inputs, return_dict=True)
        
        # replace with ib loss
        ib_loss = outputs["in_batch_negative_loss"]
        outputs["deprecated_loss"] = outputs["loss"]
        outputs["loss"] = ib_loss

        return (ib_loss, outputs) if return_outputs else ib_loss


class PreFLMRDataset(Dataset):
    
    def __init__(self,
                 args,
                 data_df, passages_df, 
                 query_tokenizer, context_tokenizer, image_processor):
        self.args = args
        self.data_df = data_df
        self.passages_df = passages_df
        self.query_tokenizer = query_tokenizer
        self.context_tokenizer = context_tokenizer
        self.image_processor = image_processor
        
        self.unique_passage_ids = set(self.passages_df.index)
    
    def __len__(self):
        return len(self.data_df)
    
    def __getitem__(self, idx):
        row = self.data_df.iloc[idx]
        # NOTE: *infoseek* subset happends to have instructions prepended to `question`.
        #   for other subsets, instructions are required.
        # query = instruction + row['question']
        assert ':' in row['question'], 'Only Infoseek has instruction prepended to question in `question` field.'
        query = row['question']
        
        pos_item_ids = row['pos_item_ids']
        pos_item_id = random.choice(pos_item_ids) # NOTE random choose here?
        pos_passage = self.passages_df.loc[pos_item_id]['passage_content']
        
        neg_item_ids = random.sample(list(self.unique_passage_ids - set(pos_item_ids)), 
                                     self.args.num_negative_examples)  # NOTE random choose here?
        neg_passages = [self.passages_df.loc[neg_item_id]['passage_content'] for neg_item_id in neg_item_ids]
        
        image_path = os.path.join(self.args.query_images_dir, row['img_path'])
        image = Image.open(image_path).convert('RGB')
        pixel_values = self.image_processor(image, return_tensors='pt')['pixel_values'] # [1, 3, 224, 224]
        
        return dict(
            query=query,
            pos_passage=pos_passage,
            neg_passages=neg_passages,
            pixel_values=pixel_values
        )
        
    def collate_fn(self, batch):
        queries = [ex['query'] for ex in batch]
        passages = [] # [pos, neg, neg, neg, pos, ...]
        for ex in batch:
            passages.append(ex['pos_passage'])
            passages.extend(ex['neg_passages'])

        Q_encoding = self.query_tokenizer(queries)
        Q_pixel_values = torch.cat([ex['pixel_values'] for ex in batch], dim=0)
        D_encoding = self.context_tokenizer(passages)
        
        # according to `modeling_flmr.py, FLMRModelForRetrieval.forward`
        inputs = dict(
            query_input_ids=Q_encoding['input_ids'],
            query_attention_mask=Q_encoding['attention_mask'],
            query_pixel_values=Q_pixel_values,
            context_input_ids=D_encoding['input_ids'],
            context_attention_mask=D_encoding['attention_mask'],
            use_in_batch_negatives=True,
            in_batch_negatives_from_all_gpus=False, # NOTE should be False here?
            num_negative_examples=self.args.num_negative_examples
        )
        return inputs


def main():
    parser = HfArgumentParser((MyArguments, PreFLMRTrainingArguments))
    my_args, training_args = parser.parse_args_into_dataclasses()
    
    ## setting up
    assert dist.get_world_size() == 4, 'The paper uses 4 gpus.'
    if dist.get_rank() == 0:
        print('## my_args: ', pformat(my_args))
        print('## training_args: ', pformat(training_args))
    
    ## setting up tokenizer
    query_tokenizer = FLMRQueryEncoderTokenizer.from_pretrained(
        my_args.model_name_or_path, subfolder="query_tokenizer")
    context_tokenizer = FLMRContextEncoderTokenizer.from_pretrained(
        my_args.model_name_or_path, subfolder="context_tokenizer")
    image_processor = AutoImageProcessor.from_pretrained(my_args.image_processor_name)

    ## setting up dataset
    data_df = load_dataset(my_args.dataset_name, f'{my_args.dataset_subset_name}_data')['train']\
        .to_pandas().set_index('question_id')
    passages_df = load_dataset(my_args.dataset_name, f'{my_args.dataset_subset_name}_passages')['train_passages']\
        .to_pandas().set_index('passage_id')
        
    dataset = PreFLMRDataset(args=my_args,
                             data_df=data_df, passages_df=passages_df,
                             query_tokenizer=query_tokenizer, 
                             context_tokenizer=context_tokenizer, 
                             image_processor=image_processor)
    
    if dist.get_rank() == 0:
        print(pformat(f'## dataset[0]: {dataset[0]}'))
    
    ## setting up model
    model = FLMRModelForRetrieval.from_pretrained(
        my_args.model_name_or_path,
        query_tokenizer=query_tokenizer,
        context_tokenizer=context_tokenizer)
    
    ## setting up training
    # build trainables    
    # PreFLMR consists of
    #   pretrained_structure_modules: pretrained text and vision encoder, remain frozen
    #   mapping_structure_modules: a 2-layer MLP_F^MLP and a Transformer block F_M^TR, lr = 1e-4
    #   non_mapping_structure_modules: remaining modules, mostly linears. lr = 1e-5
    pretrained_structure_modules = [
        model.query_text_encoder, # FLMRTextModel
        model.query_vision_encoder, # FLMRVisionModel
        model.context_text_encoder, # FLMRTextModel
        model.context_vision_encoder # FLMRVisionModel
    ]
    mapping_structure_modules = [
        model.query_vision_projection, # FLMRMultiLayerPerceptron
        model.context_vision_projection, # FLMRMultiLayerPerceptron
        model.transformer_mapping_network, # BertEncoder
    ]
    non_mapping_structure_modules = [
        model.query_text_encoder_linear, # Linear
        model.context_text_encoder_linear, # Linear
        model.transformer_mapping_input_linear, # Linear
        model.transformer_mapping_output_linear, # Linear
    ]
    # check included all paramters, nothing left
    assert set(id(p) for p in model.parameters()) == set(id(p) \
        for module in pretrained_structure_modules + mapping_structure_modules + non_mapping_structure_modules
        for p in module.parameters())
    
    for module in pretrained_structure_modules:
        for p in module.parameters():
            p.requires_grad = False
    
    if dist.get_rank() == 0:
        trainables = [pn for pn, p in model.named_parameters() if p.requires_grad]
        n_trainables = sum(p.numel() for p in model.parameters() if p.requires_grad)
        print(pformat(f'## trainables: {trainables}'))
        print(pformat(f'## n_trainables: {n_trainables:,}'))
    
    # build optimizer and constant scheduler
    #   need to deduplicate modules (some modules may share parameters)
    optimizer_groups = []
    
    mapping_structure_modules_dedup = list({id(m) : m for m in mapping_structure_modules}.values())
    for module in mapping_structure_modules_dedup:
        optimizer_groups.append(dict(params=module.parameters(), lr=training_args.mapping_structure_lr))
    
    non_mapping_structure_modules_dedup = list({id(m) : m for m in (non_mapping_structure_modules)}.values())
    for module in non_mapping_structure_modules_dedup:
        optimizer_groups.append(dict(params=module.parameters(), lr=training_args.non_mapping_structure_lr))
    
    optimizer = torch.optim.Adam(optimizer_groups)
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda epoch: 1.0)
    
    ## start training
    trainer = PreFLMRTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        data_collator=dataset.collate_fn,
        optimizers=[optimizer, scheduler]
    )
    trainer.train()
    
    trainer.save_state()
    trainer.save_model()
    query_tokenizer.save_pretrained(os.path.join(training_args.output_dir, 'query_tokenizer'))
    context_tokenizer.save_pretrained(os.path.join(training_args.output_dir, 'context_tokenizer'))


if __name__ == '__main__':
    main()

thanks in advance for your generous help!

Results compare with InternVit

Hi, would love to see some comparsion to the SOTA vit models such as InternVit and Siglip etc, especially for the Chinese version.

结果复现问题

Hello. I ran example_use_preflmr.py for EVQA according to your github code and huggingface's checkpoint, but I couldn't reproduce the PR@5=73.1 you reported in your paper. I used the parameters you mentioned on GitHub, why is that? Do I need to change the code or parameters? Below I run out the results and parameters
你好。 我根据你在github的代码和huggingface的checkpoint,针对EVQA跑了example_use_preflmr.py,但是我没能复现出你在论文中报告的PR@5=73.1.我使用的参数就是你在GitHub提到的,这是什么原因呢?是不是我需要修改代码或者参数?,下面的我跑出的结果和参数

EVQA /nas-alinlp/xiewen.xie/models/LinWeizheDragon/PreFLMR_ViT-G
Total number of questions: 3750
Pseudo Recall@1: 0.5096
Pseudo Recall@5: 0.7194666666666667
Pseudo Recall@10: 0.7861333333333334
Pseudo Recall@20: 0.8341333333333333
Pseudo Recall@50: 0.8848
Pseudo Recall@100: 0.9090666666666667
Pseudo Recall@500: 0.9349333333333333
Recall@1: 0.40266666666667
Recall@5: 0.624
Recall@10: 0.703733333333333333
Recall@20: 0.7805333333333333
Recall@50: 0.8616
Recall@100: 0.910933333333333334
Recall@500: 0.9621333333333333

source activate /nas-alinlp/xiewen.xie/envs/FLMR
python example_use_preflmr.py
--use_gpu --run_indexing
--index_root_path "."
--index_name EVQA_PreFLMR_ViT-G
--experiment_name EVQA
--indexing_batch_size 64
--image_root_dir /nas-alinlp/xiewen.xie/DATASETS/EVQA_img
--dataset_hf_path /nas-alinlp/xiewen.xie/DATASETS/BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR
--dataset EVQA
--use_split test
--nbits 8
--Ks 1 5 10 20 50 100 500
--checkpoint_path /nas-alinlp/xiewen.xie/models/LinWeizheDragon/PreFLMR_ViT-G
--image_processor_name /nas-alinlp/xiewen.xie/models/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
--query_batch_size 8
--compute_pseudo_recall \

关于index_custom_collection方法中的passage_image_feature的获取方式(How to Obtain Pass_image_feature in the Index_custom_collection Method)

我有注意到在example_use_custom_functions.py的演示代码中,有使用到passage_image_feature,在实际过程中,这个数据是通过哪种方式获取的?以及相较于直接使用图片哪种方式的文档检索效果更好一些?能给出一些建议么?
I noticed that in the demo code of example_use_custom_functions.py, passage_image_feature is used. In practice, how is this data obtained? And which way to retrieve documents is better than using pictures directly? Can you give me some advice?
image

Can this model achieve retrieval from text to (image + text)

Can this model achieve retrieval from text to (image + text)? For example, I have a query (text) and a database that contains images and their corresponding descriptions. I want to retrieve the fused features of visual embeddings and text embeddings for each image in the database. If possible, how should I implement this?Thank you very much!

For a custom document, how can I support the input of multiple images?

I followed the instructions about custom document in the readme.
`## Create document collections
num_items = 100
feature_dim = 1664

document_items = []
document_items.extend(query_memories)
document_items.extend(random.sample(query_scene_memory, 15))
document_items.extend(random.sample(memory_items, 80))
assert len(document_items) == num_items

passage_contents = []
image_paths = []
for i, document_item in enumerate(document_items):
    passage_contents.append(
        f"Instruction: {document_item['subtaskInstruction']}, captioning: {document_item['detail']['beginCaption']}")
    image_paths.append(document_item['detail']['beginPath'])
    print(f"{i}: {document_item}")

custom_collection = [
    (passage_content, None, image_path)
    for passage_content, image_path in zip(passage_contents, image_paths)
]`

However, if it can support the input of multiple images, it would be more suitable for me.
For the document, each item includes a text content and multiple images.
For the query, each item includes a text content and a image.
Is this possible? If so, how should it be modified?
Thank you sincerely!

我从huggingface上下载了LinWeizheDragon / PreFLMR_ViT-B ,但参照示例代码加载模型的时候会报错

参照示例代码:
import torch
import pandas as pd
from transformers import AutoImageProcessor, AutoModel

from flmr import index_custom_collection
from flmr import FLMRQueryEncoderTokenizer, FLMRContextEncoderTokenizer, FLMRModelForRetrieval

checkpoint_path = "/mnt/lustre/gaoyuan/models/LinWeizheDragon/PreFLMR_ViT-B"
image_processor_name = "/mnt/lustre/gaoyuan/models/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"

query_tokenizer = FLMRQueryEncoderTokenizer.from_pretrained(checkpoint_path, subfolder="query_tokenizer")
context_tokenizer = FLMRContextEncoderTokenizer.from_pretrained(
checkpoint_path, subfolder="context_tokenizer"
)

model = AutoModel.from_pretrained(
checkpoint_path,
query_tokenizer=query_tokenizer,
context_tokenizer=context_tokenizer,
trust_remote_code=True
)

报错如下:
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like bert-base-uncased is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Wikipedia Corpus used for OK-VQA task

Thanks for the great work!
In paper FLMR, I saw there are two external datasets Google Search Corpus and Wikipedia Corpus, used for OK-VQA task. I saw the Google Search Corpus here, but did not find Wikipedia Corpus. I am quite interested in the later one, so wanna ask can I find that data somewhere? Thanks in a dvance!

Best,
Shuai

load FLMR Retrieval Model Wrong!

code

flmr_model = FLMRModelForRetrieval.from_pretrained(
        "LinWeizheDragon/FLMR",
        query_tokenizer=query_tokenizer,
        context_tokenizer=context_tokenizer,
    )

error

TypeError: Object of type FLMRTextConfig is not JSON serializable.

如何在下游任务中去使用PreFLMR模型?

如题,当我有一份自定义的图片以及文档数据集时,我应该如何使用该数据集对PreFLMR的模型进行微调以达到我想要的通过图片来检索相关文档的目的?是否需要做图片以及文档的向量对齐的训练?期望能够解答我的疑惑,不甚感激。

definition of load_dataset

hello, it seems that in examples/example_use_flmr.py, the load_dataset is not defined, and other files do not include load_dataset either. anything missing during uploading? thanks
4F422D73-7173-4E04-A619-2BCF67D0128B

problem of finetuning

when run the program of finetuning on downstream tasks, the program stopped in the first epoch

Something about pretraining

Hi, I have some questions about pretraining. It seems that there is not any code about stage1 pretraing. I want to know more details about this stage. Thanks!

fail to reproduce m2kr infoseek subset results using preflmr model

Hi lin sorry to bother you, I am trying to reproduce the result of infoseek using preflmr model but the numbers are not close.

According to preflmr paper table 2, the reported PreFLMR(G B-v2 1.96B variant) zero-shot R@5 performance on infoseek is 59.6. However my results are as follows, following guide of Reproduce PreFLMR results:

Map: 100%|██████████| 4708/4708 [06:31<00:00, 12.03 examples/s]
=============================
Inference summary:
=============================
Total number of questions: 4708
Recall@1:        0.22854715378079865
Recall@5:        0.42247238742565846
Recall@10:       0.5218776550552251
Recall@20:       0.6076890399320306
Recall@50:       0.7051826677994902
Recall@100:      0.764231096006797
=============================
Done! Program exiting...

My reproduce script is almost indentical to the given one except pulling models and datasets beforehand due to network issues.

In example_use_preflmr.py i am setting index_custom_collection(n_ranks=8) on line 48 to accelerate building index, and delete num_proc=16 in ds.map(tokenize_inputs) on line 240 due to stucking at Map: 0/4708, but i believe both changes do not affect the results.

python FLMR/examples/example_use_preflmr.py \
    --use_gpu --run_indexing \
    --index_root_path "./preflmr_index" \
    --index_name Infoseek_PreFLMR_ViT-G \
    --experiment_name Infoseek \
    --indexing_batch_size 64 \
    --image_root_dir "./Infoseek/val_images" \
    --dataset_hf_path "./data/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR" \
    --dataset Infoseek \
    --use_split test \
    --nbits 8 \
    --Ks 1 5 10 20 50 100 \
    --checkpoint_path "./pretrained_models/PreFLMR_ViT-G" \
    --image_processor_name "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k" \
    --query_batch_size 8 

Besides, based on the retrieval results you sent me before (which i am very grateful) at issue#37 of RA-VQA, i calculated the R@5 manally, it is 40.57, similar to my reproducing results but not close to the reported results on paper.

In the infoseek_blip2.zip you sent me, the files are as follows

generate_test_index_test_InfoseekDatasetForDPR.valid_predictions_rank_0.pkl
generate_test_index_test_InfoseekDatasetForDPR.valid_predictions_rank_1.pkl
generate_test_index_test_InfoseekDatasetForDPR.valid_predictions_rank_2.pkl
generate_test_index_test_InfoseekDatasetForDPR.valid_predictions_rank_3.pkl
generate_train_index_test_InfoseekDatasetForDPR.train_predictions_rank_0.pkl
generate_train_index_test_InfoseekDatasetForDPR.train_predictions_rank_1.pkl
generate_train_index_test_InfoseekDatasetForDPR.train_predictions_rank_2.pkl
generate_train_index_test_InfoseekDatasetForDPR.train_predictions_rank_3.pkl
model_step_2000.ckpt

Here is how i calculated

from glob import glob
import pickle

## load results
test_results = []
for path in glob('./unzipped/generate_test_index_test_InfoseekDatasetForDPR*'):
    test_results.extend(pickle.load(open(path, 'rb'))['output'])
print('len(test_results)', len(test_results))

## load test passages(which is identical to train passages)
test_passages = pd.read_parquet('./data/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR/Infoseek_passages/test_passages-00000-of-00001.parquet')
test_passages.set_index('passage_id', inplace=True)

## load test set(which contains `pos_item_ids` that are golden labels)
test = pd.read_parquet('./data/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR/Infoseek_data/test-00000-of-00001.parquet')
test.set_index('question_id', inplace=True)

## calculate result
topk = 5

n_hit = 0
for result in test_results:
    question_id = result['question_id']
    pos_item_ids = test.loc[question_id]['pos_item_ids']
    
    top_ranking_passage_ids = [ obj['passage_id'] for obj in result['top_ranking_passages']]
    
    if any(pos_item_id in top_ranking_passage_ids[:topk] 
           for pos_item_id in pos_item_ids):
        n_hit += 1

r_at_n = n_hit / len(test_results)
print(f'R@{topk}:', r_at_n)

it returns

len(test_results) 4708
R@5: 0.4056924384027188

Did i miss something? Could you please give me some hint? Thanks!

Question about image process in customized multi-modal document

In FLMR's paper, the document is plain text without images; but in the code of this repository, customized documents can support multi-modal. Could you please tell me how the image is integrated with the text when it is actually implemented? If it is a customized multi-modal document? Is it similar to query, where the image is processed in two ways and then cancated together with the text?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.