facebookresearch / atlas Goto Github PK
View Code? Open in Web Editor NEWCode repository for supporting the paper "Atlas Few-shot Learning with Retrieval Augmented Language Models",(https//arxiv.org/abs/2208.03299)
License: Other
Code repository for supporting the paper "Atlas Few-shot Learning with Retrieval Augmented Language Models",(https//arxiv.org/abs/2208.03299)
License: Other
Hello, thank you for sharing the code with very clear explanations&example scripts.
I could reproduce the evaluation result of the provided NQ-finetuned ATLAS-large model checkpoint, using atlas/example_scripts/nq/evaluate.sh
script.
However, when I was trying to reproduce the NQ-64-shot fine-tuning experiment with the provided ATLAS-large model checkpoint (and corresponding indices), using the example script atlas/example_scripts/nq/train_fewshot.sh
, the code didn't work well with the following error message:
Traceback (most recent call last):
File "train.py", line 196, in <module>
model, optimizer, scheduler, retr_optimizer, retr_scheduler, opt, step = load_or_initialize_atlas_model(opt)
File "/home/work/atlas/atlas/src/model_io.py", line 193, in load_or_initialize_atlas_model
model, optimizer, scheduler, retr_optimizer, retr_scheduler, opt_checkpoint, loaded_step = load_atlas_model(
File "/home/work/atlas/atlas/src/model_io.py", line 153, in load_atlas_model
optimizer, scheduler, retr_optimizer, retr_scheduler = set_optim(opt, model)
File "/home/work/atlas/atlas/src/util.py", line 168, in set_optim
from src.AdamWFP32Copy import AdamWFP32Copy
File "/home/work/atlas/atlas/src/AdamWFP32Copy.py", line 11, in <module>
adamw = _adamw.F.adamw
AttributeError: module 'torch.optim.adamw' has no attribute 'F'
It seems that this error occurs when the code tries to load the model from the model_path, which calls set_optim function to set the optimizer, where AdamWFP32Copy.py is imported.
I couldn't figure out the description of the attribute 'F' in the original documentation of adamw from PyTorch, however I guess that the intention of the line 11 adamw = _adamw.F.adamw
is calling adamw function in the original pytorch implementation at line 160.
Could you provide any hints to solve this issue?
To reproduce this error, I leave some info about my working environment:
-Pytorch version: 1.12.0 (I had the same error with 1.13.0)
-Hardware: 4 A100 GPUs (in a single node)
-CUDA version: 11.3
-NVIDIA Driver version: 465.19.01
Thank you in advance!
Hi,
We are trying to reproduce the generated answers of fully and 64-shot fine-tuned ATLAS on different datasets (including HotpotQA, as in Table 10 of the paper). However, we don't find checkpoints of the fine-tuned ATLAS on these datasets in the repository. Is there any way we can get access to them?
Hi,
I try to train the models with multiple nodes on a slurm cluster. However, I get "RuntimeError: ProcessGroupNCCL does not support gather" in Line 95, dist_utils.py.
Do you have any suggestions to handle this issue, e.g., use all_gather to replace gather here? Or ProcessGroupNCCL is not expected here? I am quite confused in debugging this and thanks for any help!
Hi,
I got a total 755G index saved in my disk after encoding the whole wiki passage. The large index takes huge storage and long time to load to GPU. However, it requires less than 100G after loading to GPU, which could be the index compression mentioned in your paper. Is it possible to save and load the compressed index for better time and storage consumption?
When the code runs, the maximum passage length becomes the smaller of the two variables, self.opt.text_maxlength
and gpu_embedder_batch_size
. By default, gpu_embedder_batch_size
is set to 512, and if you run the code without modifying default option, most BERT-style dual encoders will work without issues (see line 74).
However, if you reduce gpu_embedder_batch_size
to conserve GPU memory, unexpected results can occur without warning.
Lines 61 to 89 in f8bec5c
So, it is recommended to modify line 74 as follows (as done in other parts of the code):
min(self.opt.text_maxlength, BERT_MAX_SEQ_LENGTH),
Hi and thanks for sharing your great work!
I am trying to run some experiments using MSMarco-passage(-v1) as a corpus, a standard ranking corpus consisting of 8.8M passages.
However, when the index is being built with the Contriever, this results in an enormous 1.5 TB file size (embeddings.pt
occupy most of the space). I think this is unreasonable and I am trying to understand what the reason behind it is.
And as far as I understand, Contriever is just a dense retriever with better training - so corpus embeddings should be similar to eg. DPR.
Afair, DPR indexes of msmarco-passage usually range between 30-100gb. In fact, indexing the same collection with ColBERT (a model that stores embeddings of size 128 for every token in the collection) only takes 880GB!
Even after doing the math behind it, the index should be way smaller that what is created:
768 (dimension)
x
8.8M (passages)
x
4 (bytes, corresponding to fp32 precision - although I think it is actually using fp16 for the retriever)
=
27033Mb = 27GB
I tried changing the arguments --index_mode {faiss|flat} --faiss_index_type {ivfpq, pq, flat} --faiss_code_size {None, 16, 192}
but none of them resulted in a smaller index size (testing this on a sample of 100K docs results in 19G embeddings files).
Interestingly PQ (which I assume stands for product-quantization) does not seem to have any effect on the size of the embedding files.
Am I missing something here? Any help would be greatly appreciated!
Hi,
Firstly, thank you for sharing the Atlas model in a very well-documented repo.
I'm currently working on an evaluation of Atlas on ParaRel (a dataset that tests for model consistency, with corresponding paper). However, to do a proper evaluation of potential benefits of retrieval augmentation for the evaluation results, I need to compare against a closed-book Atlas.
From what I can see, you have not shared your closed-book Atlas model weights. So I was wondering if there is a possibility to release these model weights as well? I could train my own attempt at a closed-book Atlas model, but it wouldn't be the same, since you haven't released your 350M train passages from common crawl.
Grateful for any help!
/Lovisa
The preprocessing/download_index.py
and preprocessing/download_model.py
files download data from https://dl.fbaipublicfiles.com/atlas/, which is currently giving a permissions denied error.
Are the pretrained models for ATLAS available anywhere? Or were they withdrawn after the paper was published?
Hey! When setting up the large model quickly, with 1/10th of corpora/wiki/enwiki-dec2018
(and otherwise default settings), the quality of outputs is very low:
question: who got the first nobel prize in physics answer: <extra_id_0> <pad><extra_id_0> mr. </s>
question: when is the next deadpool movie being released answer: <extra_id_0> <pad><extra_id_0> november 2020</s>
question: which mode is used for short wave broadcast service answer: <extra_id_0> <pad><extra_id_0> sms</s>
question: the south west wind blows across nigeria between answer: <extra_id_0> <pad><extra_id_0> a. </s>
question: what does hp mean in war and order answer: <extra_id_0> <pad><extra_id_0> hp </s>
in the first example, the answer is far too short ("mr. ") and manual inspection shows that the Wikipedia articles retrieved included Nobel prize winners in physics. Any idea what I'm doing wrong?
How can run this code on a normal personal machine? We don't have slurm configured, can we run it by simply modifying the run script?
I'm trying to finetune ATLAS on HotPotQA with 4 80G A100 GPU. I'm wondering which script should I use to train the model.
Also, am I correct in understanding that if I set the refresh interval as instructed in the README, I will basically get the same performance as in the paper?
Hello, I'm using 2 GPUs with a single node to train Atlas.
However, even if I set the local_rank to 0, the training doesn't start.
It still requires MASTER_ADDR, MASTER_PORT, etc.
Is there any additional information to notice?
hi, for the KILT task, you mentioned in the README that "Train/validation/test data for this task should consist of jsonl files, which should be passed to train.py as --train_data train_file_1.jsonl train_file_2.jsonl, and --eval_data eval_file_1.jsonl eval_file_2.jsonl etc. " and "Atlas will automatically process these instances appropriately, into Atlas] query inputs based on the input field and target generations based on the answer fields", yet i didn't find the dedicated KILT data preprocessing code in the project, and passing two json files to the train.py is not straight forward enough to follow.
and as you mentioned in #13, there won't be checkpoints provided for Atlas fine-tuned on KILT
would you plz kindly off a example complete training script for fine-tuning Atlas on KILT, thanks in advance
Will there be any plans to release a smaller version of ATLAS ?
Although 11B is relatively small when compared to the LLMs in the paper, it's still pretty large for ML practitioners with limited resources.
Thanks! James.
Hi,
In the blog and paper its mentioned with faiss-pq code size 64 it needs as little as 2GB.
I keep getting cuda out of memory with 12 GB gpu while trying to finetune_qa with faiss-pq code 64 and models/atlas_nq/base.
what is the minimum GPU size requirement for running atlas model during finetuning qa and at inference time?
While following the instructions here . I saw that I need to have data/nq_data
in my DATA_DIR
, but I don't see it after following the steps:
# download the NQ data
python preprocessing/prepare_qa.py --output_directory ${DATA_DIR}
# download the Wikipedia 2018 corpus
python preprocessing/download_corpus.py --corpus corpora/wiki/enwiki-dec2018 --output_directory ${DATA_DIR}
# downloads pretrained Atlas-large
python preprocessing/download_model.py --model models/atlas/${SIZE} --output_directory ${DATA_DIR}
I only see following files/folders:
Can you pls help me figure out, how can i download nq_data
? Thanks!
Hello,
I was trying to pre-train the ATLAS model (base & large size), by running the provided example script in atlas/example_scripts/mlm/train.sh with 4 40GB A100 GPUs, but then I got this error:
Traceback (most recent call last):
File "/home/work/atlas/atlas/train.py", line 223, in <module>
train(
File "/home/work/atlas/atlas/train.py", line 77, in train
reader_loss, retriever_loss = model(
File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/work/atlas/atlas/src/atlas.py", line 432, in forward
passages, _ = self.retrieve(
File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/work/atlas/atlas/src/atlas.py", line 181, in retrieve
passages, scores = retrieve_func(*args, **kwargs)[:2]
File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/work/atlas/atlas/src/atlas.py", line 170, in retrieve_with_rerank
retriever_scores = torch.einsum("id, ijd->ij", [query_emb, passage_emb])
File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/functional.py", line 328, in einsum
return einsum(equation, *_operands)
File "/home/work/.conda/envs/atlas/lib/python3.10/site-packages/torch/functional.py", line 330, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
RuntimeError: einsum(): operands do not broadcast with remapped shapes [original->remapped]: [2, 768]->[2, 1, 768] [2, 100, 1536]->[2, 100, 1536]
srun: error: localhost: task 0: Exited with exit code 1
srun: error: localhost: task 3: Exited with exit code 1
srun: error: localhost: task 1: Exited with exit code 1
srun: error: localhost: task 2: Exited with exit code 1
I used the provided passages (Wikipedia Dec2018 dump), and ran the script without any changes in training arguments.
So, the batch size per device was 2 and 100 documents were retrieved by the retriever, regarding [2, 768]->[2, 1, 768] [2, 100, 1536]->[2, 100, 1536]
in the error message above.
In addition, I found that the script and the overall pre-training process worked well after removing this line from the script, i.e., doing re-indexing of the whole passages instead of doing re-ranking, although this resulted in lower few-shot performance compared to the scores reported in Table.19 from the paper. (However, I think the performance issue might be irrelevant to the removal of this line)
--retrieve_with_rerank --n_to_rerank_with_retrieve_with_rerank 100 \
Could you provide any hints to solve this issue? Thank you in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.