facebookresearch / atlas Goto Github PK

Code repository for supporting the paper "Atlas Few-shot Learning with Retrieval Augmented Language Models",(https//arxiv.org/abs/2208.03299)

License: Other

Python 93.58% Shell 6.42%

atlas's Issues

"AttributeError: module 'torch.optim.adamw' has no attribute 'F'" in AdamWFP32Copy.py

Hello, thank you for sharing the code with very clear explanations&example scripts.

I could reproduce the evaluation result of the provided NQ-finetuned ATLAS-large model checkpoint, using atlas/example_scripts/nq/evaluate.sh script.
However, when I was trying to reproduce the NQ-64-shot fine-tuning experiment with the provided ATLAS-large model checkpoint (and corresponding indices), using the example script atlas/example_scripts/nq/train_fewshot.sh, the code didn't work well with the following error message:

Traceback (most recent call last):                                                                                                                                                          
  File "train.py", line 196, in <module>                                                                                                                                                    
    model, optimizer, scheduler, retr_optimizer, retr_scheduler, opt, step = load_or_initialize_atlas_model(opt)                                                                            
  File "/home/work/atlas/atlas/src/model_io.py", line 193, in load_or_initialize_atlas_model                                                                                                
    model, optimizer, scheduler, retr_optimizer, retr_scheduler, opt_checkpoint, loaded_step = load_atlas_model(                                                                            
  File "/home/work/atlas/atlas/src/model_io.py", line 153, in load_atlas_model                                                                                                              
    optimizer, scheduler, retr_optimizer, retr_scheduler = set_optim(opt, model)                                                                                                            
  File "/home/work/atlas/atlas/src/util.py", line 168, in set_optim                                                                                                                         
    from src.AdamWFP32Copy import AdamWFP32Copy                                                                                                                                             
  File "/home/work/atlas/atlas/src/AdamWFP32Copy.py", line 11, in <module>                                                                                                                  
    adamw = _adamw.F.adamw                                                                    
AttributeError: module 'torch.optim.adamw' has no attribute 'F'

It seems that this error occurs when the code tries to load the model from the model_path, which calls set_optim function to set the optimizer, where AdamWFP32Copy.py is imported.
I couldn't figure out the description of the attribute 'F' in the original documentation of adamw from PyTorch, however I guess that the intention of the line 11 adamw = _adamw.F.adamw is calling adamw function in the original pytorch implementation at line 160.
Could you provide any hints to solve this issue?

To reproduce this error, I leave some info about my working environment:
-Pytorch version: 1.12.0 (I had the same error with 1.13.0)
-Hardware: 4 A100 GPUs (in a single node)
-CUDA version: 11.3
-NVIDIA Driver version: 465.19.01

Thank you in advance!

Access to the models [fully and 64-shot] fine-tuned on different datasets

Hi,

We are trying to reproduce the generated answers of fully and 64-shot fine-tuned ATLAS on different datasets (including HotpotQA, as in Table 10 of the paper). However, we don't find checkpoints of the fine-tuned ATLAS on these datasets in the repository. Is there any way we can get access to them?

what is srun?

RuntimeError: ProcessGroupNCCL does not support gather

Hi,

I try to train the models with multiple nodes on a slurm cluster. However, I get "RuntimeError: ProcessGroupNCCL does not support gather" in Line 95, dist_utils.py.

Do you have any suggestions to handle this issue, e.g., use all_gather to replace gather here? Or ProcessGroupNCCL is not expected here? I am quite confused in debugging this and thanks for any help!

Save and load compressed index

Hi,

I got a total 755G index saved in my disk after encoding the whole wiki passage. The large index takes huge storage and long time to load to GPU. However, it requires less than 100G after loading to GPU, which could be the index compression mentioned in your paper. Is it possible to save and load the compressed index for better time and storage consumption?

[Fix] Retriever tokenization function in atlas.py needs correction

When the code runs, the maximum passage length becomes the smaller of the two variables, self.opt.text_maxlength and gpu_embedder_batch_size. By default, gpu_embedder_batch_size is set to 512, and if you run the code without modifying default option, most BERT-style dual encoders will work without issues (see line 74).

However, if you reduce gpu_embedder_batch_size to conserve GPU memory, unexpected results can occur without warning.

atlas/src/atlas.py

Lines 61 to 89 in f8bec5c

 @torch.no_grad() 

 def build_index(self, index, passages, gpu_embedder_batch_size, logger=None): 

 n_batch = math.ceil(len(passages) / gpu_embedder_batch_size) 

 retrieverfp16 = self._get_fp16_retriever_copy() 

 total = 0 

 for i in range(n_batch): 

 batch = passages[i * gpu_embedder_batch_size : (i + 1) * gpu_embedder_batch_size] 

 batch = [self.opt.retriever_format.format(**example) for example in batch] 

 batch_enc = self.retriever_tokenizer( 

 batch, 

 padding="longest", 

 return_tensors="pt", 

 max_length=min(self.opt.text_maxlength, gpu_embedder_batch_size), 

 truncation=True, 

 ) 

 embeddings = retrieverfp16(**_to_cuda(batch_enc), is_passages=True) 

 index.embeddings[:, total : total + len(embeddings)] = embeddings.T 

 total += len(embeddings) 

 if i % 500 == 0 and i > 0: 

 logger.info(f"Number of passages encoded: {total}") 

 dist_utils.barrier() 

 logger.info(f"{total} passages encoded on process: {dist_utils.get_rank()}") 

 if not index.is_index_trained(): 

 logger.info(f"Building faiss indices") 

 index.train_index()

So, it is recommended to modify line 74 as follows (as done in other parts of the code):

min(self.opt.text_maxlength, BERT_MAX_SEQ_LENGTH),

Why is the index size so big?

Hi and thanks for sharing your great work!

I am trying to run some experiments using MSMarco-passage(-v1) as a corpus, a standard ranking corpus consisting of 8.8M passages.
However, when the index is being built with the Contriever, this results in an enormous 1.5 TB file size (embeddings.pt occupy most of the space). I think this is unreasonable and I am trying to understand what the reason behind it is.
And as far as I understand, Contriever is just a dense retriever with better training - so corpus embeddings should be similar to eg. DPR.
Afair, DPR indexes of msmarco-passage usually range between 30-100gb. In fact, indexing the same collection with ColBERT (a model that stores embeddings of size 128 for every token in the collection) only takes 880GB!

Even after doing the math behind it, the index should be way smaller that what is created:

768 (dimension)
x
8.8M (passages)
x
4 (bytes, corresponding to fp32 precision - although I think it is actually using fp16 for the retriever)
=
27033Mb = 27GB

I tried changing the arguments --index_mode {faiss|flat} --faiss_index_type {ivfpq, pq, flat} --faiss_code_size {None, 16, 192} but none of them resulted in a smaller index size (testing this on a sample of 100K docs results in 19G embeddings files).
Interestingly PQ (which I assume stands for product-quantization) does not seem to have any effect on the size of the embedding files.

Am I missing something here? Any help would be greatly appreciated!

Possibility of releasing closed-book Atlas models?

Hi,
Firstly, thank you for sharing the Atlas model in a very well-documented repo.

I'm currently working on an evaluation of Atlas on ParaRel (a dataset that tests for model consistency, with corresponding paper). However, to do a proper evaluation of potential benefits of retrieval augmentation for the evaluation results, I need to compare against a closed-book Atlas.

From what I can see, you have not shared your closed-book Atlas model weights. So I was wondering if there is a possibility to release these model weights as well? I could train my own attempt at a closed-book Atlas model, but it wouldn't be the same, since you haven't released your 350M train passages from common crawl.

Grateful for any help!
/Lovisa

Are the pretrained models available somewhere?

The preprocessing/download_index.py and preprocessing/download_model.py files download data from https://dl.fbaipublicfiles.com/atlas/, which is currently giving a permissions denied error.

Are the pretrained models for ATLAS available anywhere? Or were they withdrawn after the paper was published?

Poor quality of outputs from large model on 1/10th of wikipedia

Hey! When setting up the large model quickly, with 1/10th of corpora/wiki/enwiki-dec2018 (and otherwise default settings), the quality of outputs is very low:

question: who got the first nobel prize in physics answer: <extra_id_0> <pad><extra_id_0> mr. </s>
question: when is the next deadpool movie being released answer: <extra_id_0> <pad><extra_id_0> november 2020</s>
question: which mode is used for short wave broadcast service answer: <extra_id_0> <pad><extra_id_0> sms</s>
question: the south west wind blows across nigeria between answer: <extra_id_0> <pad><extra_id_0> a. </s>
question: what does hp mean in war and order answer: <extra_id_0> <pad><extra_id_0> hp </s>

in the first example, the answer is far too short ("mr. ") and manual inspection shows that the Wikipedia articles retrieved included Nobel prize winners in physics. Any idea what I'm doing wrong?

How can run this code on a normal personal machine? We don't have slurm configured, can we run it by simply modifying the run script?

What's the difference between train.py and finetune_qa.py

I'm trying to finetune ATLAS on HotPotQA with 4 80G A100 GPU. I'm wondering which script should I use to train the model.

Also, am I correct in understanding that if I set the refresh interval as instructed in the README, I will basically get the same performance as in the paper?

How do you conduct distributed training?

Hello, I'm using 2 GPUs with a single node to train Atlas.

However, even if I set the local_rank to 0, the training doesn't start.
It still requires MASTER_ADDR, MASTER_PORT, etc.

Is there any additional information to notice?

Example training script for the KILT task

hi, for the KILT task, you mentioned in the README that "Train/validation/test data for this task should consist of jsonl files, which should be passed to train.py as --train_data train_file_1.jsonl train_file_2.jsonl, and --eval_data eval_file_1.jsonl eval_file_2.jsonl etc. " and "Atlas will automatically process these instances appropriately, into Atlas] query inputs based on the input field and target generations based on the answer fields", yet i didn't find the dedicated KILT data preprocessing code in the project, and passing two json files to the train.py is not straight forward enough to follow.
and as you mentioned in #13, there won't be checkpoints provided for Atlas fine-tuned on KILT
would you plz kindly off a example complete training script for fine-tuning Atlas on KILT, thanks in advance

Small ATLAS

Will there be any plans to release a smaller version of ATLAS ?

Although 11B is relatively small when compared to the LLMs in the paper, it's still pretty large for ML practitioners with limited resources.

Thanks! James.

Running Atlas on small GPU's.

Hi,

In the blog and paper its mentioned with faiss-pq code size 64 it needs as little as 2GB.
I keep getting cuda out of memory with 12 GB gpu while trying to finetune_qa with faiss-pq code 64 and models/atlas_nq/base.

what is the minimum GPU size requirement for running atlas model during finetuning qa and at inference time?

Not able to download the NQ data

While following the instructions here . I saw that I need to have data/nq_data in my DATA_DIR, but I don't see it after following the steps:

# download the NQ data
python preprocessing/prepare_qa.py --output_directory ${DATA_DIR} 
# download the Wikipedia 2018 corpus
python preprocessing/download_corpus.py --corpus corpora/wiki/enwiki-dec2018 --output_directory ${DATA_DIR} 
# downloads pretrained Atlas-large
python preprocessing/download_model.py --model models/atlas/${SIZE} --output_directory ${DATA_DIR}

I only see following files/folders:

Can you pls help me figure out, how can i download nq_data? Thanks!

"RuntimeError: einsum(): operands do not broadcast with remapped shapes [original->remapped]" during reproducing mlm pre-training

Hello,
I was trying to pre-train the ATLAS model (base & large size), by running the provided example script in atlas/example_scripts/mlm/train.sh with 4 40GB A100 GPUs, but then I got this error:

Traceback (most recent call last):                                                            
  File "/home/work/atlas/atlas/train.py", line 223, in <module>                                                                                                                             
    train(                                     
  File "/home/work/atlas/atlas/train.py", line 77, in train                                                                                                                                 
    reader_loss, retriever_loss = model(                                                      
  File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl                                                                     
    return forward_call(*input, **kwargs)                                                     
  File "/home/work/atlas/atlas/src/atlas.py", line 432, in forward                                                                                                                          
    passages, _ = self.retrieve(                                                              
  File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context                                                                
    return func(*args, **kwargs)                                                              
  File "/home/work/atlas/atlas/src/atlas.py", line 181, in retrieve                                                                                                                         
    passages, scores = retrieve_func(*args, **kwargs)[:2]                                                                                                                                   
  File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context                                                                
    return func(*args, **kwargs)                                                              
  File "/home/work/atlas/atlas/src/atlas.py", line 170, in retrieve_with_rerank                                                                                                             
    retriever_scores = torch.einsum("id, ijd->ij", [query_emb, passage_emb])                                                                                                                
  File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/functional.py", line 328, in einsum                                                                                 
    return einsum(equation, *_operands)                                                       
  File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/functional.py", line 330, in einsum                                                                                 
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]                                                                                                                     
RuntimeError: einsum(): operands do not broadcast with remapped shapes [original->remapped]: [2, 768]->[2, 1, 768] [2, 100, 1536]->[2, 100, 1536]                                           
srun: error: localhost: task 0: Exited with exit code 1                                                                                                                                     
srun: error: localhost: task 3: Exited with exit code 1                                                                                                                                     
srun: error: localhost: task 1: Exited with exit code 1                                                                                                                                     
srun: error: localhost: task 2: Exited with exit code 1

I used the provided passages (Wikipedia Dec2018 dump), and ran the script without any changes in training arguments.
So, the batch size per device was 2 and 100 documents were retrieved by the retriever, regarding [2, 768]->[2, 1, 768] [2, 100, 1536]->[2, 100, 1536] in the error message above.
In addition, I found that the script and the overall pre-training process worked well after removing this line from the script, i.e., doing re-indexing of the whole passages instead of doing re-ranking, although this resulted in lower few-shot performance compared to the scores reported in Table.19 from the paper. (However, I think the performance issue might be irrelevant to the removal of this line)
--retrieve_with_rerank --n_to_rerank_with_retrieve_with_rerank 100 \

Could you provide any hints to solve this issue? Thank you in advance!

	@torch.no_grad()
	def build_index(self, index, passages, gpu_embedder_batch_size, logger=None):
	n_batch = math.ceil(len(passages) / gpu_embedder_batch_size)
	retrieverfp16 = self._get_fp16_retriever_copy()

	total = 0
	for i in range(n_batch):
	batch = passages[i * gpu_embedder_batch_size : (i + 1) * gpu_embedder_batch_size]
	batch = [self.opt.retriever_format.format(**example) for example in batch]
	batch_enc = self.retriever_tokenizer(
	batch,
	padding="longest",
	return_tensors="pt",
	max_length=min(self.opt.text_maxlength, gpu_embedder_batch_size),
	truncation=True,
	)

	embeddings = retrieverfp16(**_to_cuda(batch_enc), is_passages=True)
	index.embeddings[:, total : total + len(embeddings)] = embeddings.T
	total += len(embeddings)
	if i % 500 == 0 and i > 0:
	logger.info(f"Number of passages encoded: {total}")
	dist_utils.barrier()
	logger.info(f"{total} passages encoded on process: {dist_utils.get_rank()}")

	if not index.is_index_trained():
	logger.info(f"Building faiss indices")
	index.train_index()

facebookresearch / atlas Goto Github PK

atlas's Issues

Recommend Projects

Recommend Topics

Recommend Org