luyug / coil Goto Github PK
View Code? Open in Web Editor NEWNAACL2021 - COIL Contextualized Lexical Retriever
License: Apache License 2.0
NAACL2021 - COIL Contextualized Lexical Retriever
License: Apache License 2.0
Hi :)
I am looking for the data, for running the training script for the uniCoil model. The links under "Resource" section is expired/does not work.
Where can I find the data elsewhere?
I notice that C-Coil is at the top of the "MS MARCO Passage Ranking Leaderboard", the results are "0.427 on eval" and "0.443 on dev". But the result in https://github.com/luyug/COIL/tree/main/examples/c-coil is "0.3734 on MARCO DEV".
I wonder why the gap is so big? Is it because of ensemble? I didn't find any relevant explanation in the Coil paper.
I find some of the encoding output-dir is empty because the error happens, while the others are normal and filled with cls&token file.
Traceback (most recent call last):
File "run_marco.py", line 303, in
main()
File "run_marco.py", line 217, in main
data_args.encode_in_path, tokenizer, p_max_len=data_args.p_max_len
File "/nfs/users/wangyile/coil/marco_datasets.py", line 126, in init
data_files=path_to_json,
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/load.py", line 589, in load_dataset
path, script_version=script_version, download_config=download_config, download_mode=download_mode, dataset=True
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/load.py", line 267, in prepare_module
local_path = cached_path(file_path, download_config=download_config)
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/utils/file_utils.py", line 308, in cached_path
use_etag=download_config.use_etag,
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/utils/file_utils.py", line 487, in get_from_cache
raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/json/json.py
do_encode - will pad the passages to maximum passage length
reps will now include the representation for the padding token - which will then be added to the index
the padding token in the index affects the score because any query containing any padding will now match that token
Should the padding token 0 be removed from the index after sharding or during the sharding process?
ColBERT removed punctuations in document because they think they are useless. I wonder if you removed punctuations when computing overlapping tokens between query and document?
When I followed the usage to train COIL, some error accurred:
Traceback (most recent call last): File "run_marco.py", line 301, in <module> main() File "run_marco.py", line 81, in main parser = HfArgumentParser((ModelArguments, DataArguments, COILTrainingArguments)) File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 52, in __init__ self._add_dataclass_arguments(dtype) File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 93, in _add_dataclass_arguments elif hasattr(field.type, "__origin__") and issubclass(field.type.__origin__, List): File "/root/miniconda3/lib/python3.8/typing.py", line 774, in __subclasscheck__ return issubclass(cls, self.__origin__) TypeError: issubclass() arg 1 must be a class
I followed the environment in README, and don't know how to fix this error ..
Your work is excellent !
I'm looking forward to your source .
is there any guide to making a searching API with COIL?
Hello! In the paper, you report a dense retriever that you train in Table 1 and 2 ("Dense (our train)"). Is the code reproduce this result in this repo? And if so, do you have any pointers on how to train/evaluate one?
Thanks!
I ran these commands in Google Colab with GPU
!wget http://boston.lti.cs.cmu.edu/luyug/coil/msmarco-psg/psg-train.tar.gz
!tar xfz psg-train.tar.gz
!git clone https://github.com/luyug/COIL
!pip install transformers datasets
! cd COIL && python run_marco.py --output_dir model --model_name_or_path bert-base-uncased --do_train --save_steps 4000 --train_dir ../psg-train --q_max_len 16 --p_max_len 128 --fp16 --per_device_train_batch_size 8 --train_group_size 8 --cls_dim 768 --token_dim 32 --warmup_ratio 0.1 --learning_rate 5e-6 --num_train_epochs 5 --overwrite_output_dir --dataloader_num_workers 16 --no_sep --pooling max
This is the output I got:
fatal: destination path 'COIL' already exists and is not an empty directory.
Requirement already satisfied: transformers in /usr/local/lib/python3.7/dist-packages (4.10.2)
Requirement already satisfied: datasets in /usr/local/lib/python3.7/dist-packages (1.11.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.0.12)
Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.10.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers) (21.0)
Requirement already satisfied: huggingface-hub>=0.0.12 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.16)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.62.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.19.5)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.45)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (5.4.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.6.4)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from huggingface-hub>=0.0.12->transformers) (3.7.4.3)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers) (2.4.7)
Requirement already satisfied: fsspec>=2021.05.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2021.8.1)
Requirement already satisfied: pyarrow!=4.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.12.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.7/dist-packages (from datasets) (2.0.2)
Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.1.5)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2021.5.30)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.5.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.0.1)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (7.1.2)
09/12/2021 02:05:20 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True
09/12/2021 02:05:20 - INFO - __main__ - Training/evaluation parameters COILTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=16,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_encode=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-06,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=model/runs/Sep12_02-05-20_2992d74c8c9d,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=5.0,
output_dir=model,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=model,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=model,
save_on_each_node=False,
save_steps=4000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.0,
)
09/12/2021 02:05:20 - INFO - __main__ - Model params ModelArguments(model_name_or_path='bert-base-uncased', config_name=None, tokenizer_name=None, cache_dir=None, token_dim=32, cls_dim=768, token_rep_relu=False, token_norm_after=False, cls_norm_after=False, x_device_negatives=False, pooling='max', no_sep=True, no_cls=False, cls_only=False)
09/12/2021 02:05:20 - INFO - filelock - Lock 140242889433168 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
Downloading: 100% 570/570 [00:00<00:00, 476kB/s]
09/12/2021 02:05:21 - INFO - filelock - Lock 140242889433168 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
09/12/2021 02:05:21 - INFO - filelock - Lock 140242889394384 acquired on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
Downloading: 100% 28.0/28.0 [00:00<00:00, 33.2kB/s]
09/12/2021 02:05:21 - INFO - filelock - Lock 140242889394384 released on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
09/12/2021 02:05:21 - INFO - filelock - Lock 140242889467600 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
Downloading: 100% 232k/232k [00:00<00:00, 1.10MB/s]
09/12/2021 02:05:22 - INFO - filelock - Lock 140242889467600 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
09/12/2021 02:05:22 - INFO - filelock - Lock 140242856865744 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
Downloading: 100% 466k/466k [00:00<00:00, 3.41MB/s]
09/12/2021 02:05:23 - INFO - filelock - Lock 140242856865744 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
09/12/2021 02:05:23 - INFO - filelock - Lock 140242835342416 acquired on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock
Downloading: 100% 440M/440M [00:12<00:00, 35.5MB/s]
09/12/2021 02:05:36 - INFO - filelock - Lock 140242835342416 released on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
09/12/2021 02:05:38 - WARNING - datasets.builder - Using custom data configuration default-ac64881b8f58639a
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-ac64881b8f58639a/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...
Traceback (most recent call last):
File "run_marco.py", line 302, in <module>
main()
File "run_marco.py", line 146, in main
data_args, data_args.train_path, tokenizer=tokenizer,
File "/content/COIL/marco_datasets.py", line 37, in __init__
'passage': [datasets.Value('int32')],
File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 852, in load_dataset
use_auth_token=use_auth_token,
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 616, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 693, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1163, in _prepare_split
generator, unit=" tables", leave=False, disable=bool(logging.get_verbosity() == logging.NOTSET)
File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1185, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables
yield (file_idx, batch_idx), self._cast_classlabels(pa_table)
File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/json/json.py", line 76, in _cast_classlabels
[pa_table[name] for name in self.config.features], schema=self.config.schema
File "pyarrow/table.pxi", line 1515, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 553, in pyarrow.lib._sanitize_arrays
File "pyarrow/array.pxi", line 328, in pyarrow.lib.asarray
File "pyarrow/table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast
File "/usr/local/lib/python3.7/dist-packages/pyarrow/compute.py", line 281, in cast
return call_function("cast", [arr], options)
File "pyarrow/_compute.pyx", line 465, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 294, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<qid: string, query: list<item: int64>> to struct using function cast_struct
Are there any pre-trained models available for the uniCOIL model described in https://github.com/luyug/COIL/tree/main/uniCOIL?
CC: @MXueguang
The error I get while running run_msmarco.py
Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry']
I have downloaded the psg-train and extracted them. THe path_to_tsv variable in GroupedMarcoTrainDataset has all the .json files,
Hi, thank you for sharing this codes.
I tested the latency of COIL using the retriever-fast.py
with one thread and one shard. Batch size is set to one. The cpu info is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. However, the query latency is roughly 4 seconds, which is substantially larger than 0.38s reported in the paper. I wonder why this happens. Does the paper use multi-threads to evaluate the latency?
Running the command under the training section of the README, the program fails in the first optimization step with the following message:
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [535,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Which is thrown from the following:
...
File "/task_runtime/src/transformers-4.2.1/src/transformers/models/bert/modeling_bert.py", line 956, in forward
past_key_values_length=past_key_values_length,
...
File "/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 2043, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
In other words, the base model (bert-base-cased) encounters an input with a larger sequence length than what it can handle (535 > 512).
Given the above, how do you get around it, and apply your method to entire documents? (i.e., the MS MARCO Document Ranking table)
Just wondering if you could add somewhere a description of your MS MARCO submission "C-COIL + RoBERTa" from 14/07/2021. Which modifications to COIL have you made and what was the motivation?
How is corpus-d2q is prepared? On what p_max_len is castorini/unicoil-d2q-msmarco-passage trained? Can I use p_max_len as 512 and encode using it?
Awesome idea and exiting exp result.
Still, I am confused about the implement of COIL-full, when doing dense retrieval, can we do ANN search to speed up by using FAISS, or brute-force search indeed ? What's the implement in the paper experiment?
Thank you for sharing the codes. COIL achieves very impressive retrieval performance. I wonder how to use GPU for retrieval.
Just wanted to ping you to let you know that the hosted TSV files (at least the dev ones) for COIL aren't in the correct format for evaluation.
EG:
1048585 7187158 35.926036089658744
1048585 7187160 35.790479123592384
1048585 7187155 35.65535098314285
1048585 7187157 34.09628629684448
1048585 7617404 33.498324900865555
1048585 3856131 31.57883720099926
1048585 7617413 31.314840689301487
1048585 7187156 31.123393774032593
1048585 7617411 30.926150113344196
1048585 353739 30.901350378990173
I would expect:
1048585 7187158 1
1048585 7187160 2
1048585 7187155 3
1048585 7187157 4
1048585 7617404 5
1048585 3856131 6
1048585 7617413 7
1048585 7187156 8
1048585 7617411 9
1048585 353739 10
Clearly it's no real problem as it's easy to fix locally, but I'm not sure if this was intended or not.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.