luyug / coil Goto Github PK

View Code? Open in Web Editor NEW

141.0 2.0 26.0 94 KB

NAACL2021 - COIL Contextualized Lexical Retriever

License: Apache License 2.0

Python 100.00%

retrieval

coil's People

Contributors

Stargazers

Watchers

coil's Issues

training data for UniCoil - links not working

Hi :)

I am looking for the data, for running the training script for the uniCoil model. The links under "Resource" section is expired/does not work.

Where can I find the data elsewhere?

Question about the result on "MSMARCO Passage Leadboard".

I notice that C-Coil is at the top of the "MS MARCO Passage Ranking Leaderboard", the results are "0.427 on eval" and "0.443 on dev". But the result in https://github.com/luyug/COIL/tree/main/examples/c-coil is "0.3734 on MARCO DEV".

I wonder why the gap is so big? Is it because of ensemble? I didn't find any relevant explanation in the Coil paper.

Dataset error when encoding document

I find some of the encoding output-dir is empty because the error happens, while the others are normal and filled with cls&token file.

Traceback (most recent call last):
File "run_marco.py", line 303, in
main()
File "run_marco.py", line 217, in main
data_args.encode_in_path, tokenizer, p_max_len=data_args.p_max_len
File "/nfs/users/wangyile/coil/marco_datasets.py", line 126, in init
data_files=path_to_json,
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/load.py", line 589, in load_dataset
path, script_version=script_version, download_config=download_config, download_mode=download_mode, dataset=True
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/load.py", line 267, in prepare_module
local_path = cached_path(file_path, download_config=download_config)
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/utils/file_utils.py", line 308, in cached_path
use_etag=download_config.use_etag,
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/utils/file_utils.py", line 487, in get_from_cache
raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/json/json.py

Small typo and bug?

Hi @luyug, Great work!!!!

I am trying to replicate COIL, there is some typo I noticed.

in README.md Encoding section

    --token_dim 768 \  
    --cls_dim 32 \

should be 32, 768 instead?

COIL/trainer.py

Line 42 in 813a076

super().create_optimizer_and_scheduler(self.args.warmup_steps)

num_training_steps should be passed into the function rather than warmup_steps?

Padding Tokens - in the inverted list index

do_encode - will pad the passages to maximum passage length
reps will now include the representation for the padding token - which will then be added to the index
the padding token in the index affects the score because any query containing any padding will now match that token

Should the padding token 0 be removed from the index after sharding or during the sharding process?

Did you remove punctuations before computing the document score?

ColBERT removed punctuations in document because they think they are useless. I wonder if you removed punctuations when computing overlapping tokens between query and document?

Training error

When I followed the usage to train COIL, some error accurred:
Traceback (most recent call last): File "run_marco.py", line 301, in <module> main() File "run_marco.py", line 81, in main parser = HfArgumentParser((ModelArguments, DataArguments, COILTrainingArguments)) File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 52, in __init__ self._add_dataclass_arguments(dtype) File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 93, in _add_dataclass_arguments elif hasattr(field.type, "__origin__") and issubclass(field.type.__origin__, List): File "/root/miniconda3/lib/python3.8/typing.py", line 774, in __subclasscheck__ return issubclass(cls, self.__origin__) TypeError: issubclass() arg 1 must be a class
I followed the environment in README, and don't know how to fix this error ..

When do you update the codes ?

Your work is excellent !

I'm looking forward to your source .

How do I load the model saved using unicoil training script using pyserini UnicoilDocumentEncoder ?

Guide to do search with COIL/uniCOIL

is there any guide to making a searching API with COIL?

Reproducing dense retriever results

Hello! In the paper, you report a dense retriever that you train in Table 1 and 2 ("Dense (our train)"). Is the code reproduce this result in this repo? And if so, do you have any pointers on how to train/evaluate one?

Thanks!

pyarrow.lib.ArrowNotImplementedError during training phrase

I ran these commands in Google Colab with GPU

!wget http://boston.lti.cs.cmu.edu/luyug/coil/msmarco-psg/psg-train.tar.gz
!tar xfz psg-train.tar.gz
!git clone https://github.com/luyug/COIL
!pip install transformers datasets

! cd COIL && python run_marco.py --output_dir model --model_name_or_path bert-base-uncased --do_train --save_steps 4000 --train_dir ../psg-train --q_max_len 16 --p_max_len 128 --fp16 --per_device_train_batch_size 8 --train_group_size 8 --cls_dim 768 --token_dim 32 --warmup_ratio 0.1 --learning_rate 5e-6 --num_train_epochs 5 --overwrite_output_dir --dataloader_num_workers 16 --no_sep --pooling max

This is the output I got:

fatal: destination path 'COIL' already exists and is not an empty directory.
Requirement already satisfied: transformers in /usr/local/lib/python3.7/dist-packages (4.10.2)
Requirement already satisfied: datasets in /usr/local/lib/python3.7/dist-packages (1.11.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.0.12)
Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.10.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers) (21.0)
Requirement already satisfied: huggingface-hub>=0.0.12 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.16)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.62.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.19.5)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.45)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (5.4.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.6.4)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from huggingface-hub>=0.0.12->transformers) (3.7.4.3)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers) (2.4.7)
Requirement already satisfied: fsspec>=2021.05.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2021.8.1)
Requirement already satisfied: pyarrow!=4.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.12.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.7/dist-packages (from datasets) (2.0.2)
Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.1.5)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2021.5.30)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.5.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.0.1)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (7.1.2)
09/12/2021 02:05:20 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True
09/12/2021 02:05:20 - INFO - __main__ -   Training/evaluation parameters COILTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=16,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_encode=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-06,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=model/runs/Sep12_02-05-20_2992d74c8c9d,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=5.0,
output_dir=model,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=model,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=model,
save_on_each_node=False,
save_steps=4000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.0,
)
09/12/2021 02:05:20 - INFO - __main__ -   Model params ModelArguments(model_name_or_path='bert-base-uncased', config_name=None, tokenizer_name=None, cache_dir=None, token_dim=32, cls_dim=768, token_rep_relu=False, token_norm_after=False, cls_norm_after=False, x_device_negatives=False, pooling='max', no_sep=True, no_cls=False, cls_only=False)
09/12/2021 02:05:20 - INFO - filelock -   Lock 140242889433168 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
Downloading: 100% 570/570 [00:00<00:00, 476kB/s]
09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889433168 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889394384 acquired on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
Downloading: 100% 28.0/28.0 [00:00<00:00, 33.2kB/s]
09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889394384 released on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889467600 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
Downloading: 100% 232k/232k [00:00<00:00, 1.10MB/s]
09/12/2021 02:05:22 - INFO - filelock -   Lock 140242889467600 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
09/12/2021 02:05:22 - INFO - filelock -   Lock 140242856865744 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
Downloading: 100% 466k/466k [00:00<00:00, 3.41MB/s]
09/12/2021 02:05:23 - INFO - filelock -   Lock 140242856865744 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
09/12/2021 02:05:23 - INFO - filelock -   Lock 140242835342416 acquired on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock
Downloading: 100% 440M/440M [00:12<00:00, 35.5MB/s]
09/12/2021 02:05:36 - INFO - filelock -   Lock 140242835342416 released on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
09/12/2021 02:05:38 - WARNING - datasets.builder -   Using custom data configuration default-ac64881b8f58639a
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-ac64881b8f58639a/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...
Traceback (most recent call last):
  File "run_marco.py", line 302, in <module>
    main()
  File "run_marco.py", line 146, in main
    data_args, data_args.train_path, tokenizer=tokenizer,
  File "/content/COIL/marco_datasets.py", line 37, in __init__
    'passage': [datasets.Value('int32')],
  File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 852, in load_dataset
    use_auth_token=use_auth_token,
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 616, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 693, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1163, in _prepare_split
    generator, unit=" tables", leave=False, disable=bool(logging.get_verbosity() == logging.NOTSET)
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1185, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables
    yield (file_idx, batch_idx), self._cast_classlabels(pa_table)
  File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/json/json.py", line 76, in _cast_classlabels
    [pa_table[name] for name in self.config.features], schema=self.config.schema
  File "pyarrow/table.pxi", line 1515, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 553, in pyarrow.lib._sanitize_arrays
  File "pyarrow/array.pxi", line 328, in pyarrow.lib.asarray
  File "pyarrow/table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast
  File "/usr/local/lib/python3.7/dist-packages/pyarrow/compute.py", line 281, in cast
    return call_function("cast", [arr], options)
  File "pyarrow/_compute.pyx", line 465, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 294, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<qid: string, query: list<item: int64>> to struct using function cast_struct

Pre-trained models for uniCOIL

Are there any pre-trained models available for the uniCOIL model described in https://github.com/luyug/COIL/tree/main/uniCOIL?

CC: @MXueguang

Error with loading dataset

The error I get while running run_msmarco.py

Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry']

I have downloaded the psg-train and extracted them. THe path_to_tsv variable in GroupedMarcoTrainDataset has all the .json files,

Retrieval latency is very large with one thread

Hi, thank you for sharing this codes.
I tested the latency of COIL using the retriever-fast.py with one thread and one shard. Batch size is set to one. The cpu info is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. However, the query latency is roughly 4 seconds, which is substantially larger than 0.38s reported in the paper. I wonder why this happens. Does the paper use multi-threads to evaluate the latency?

Default training command - Issues when encountering documents longer than 512

Running the command under the training section of the README, the program fails in the first optimization step with the following message:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [535,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Which is thrown from the following:

  ...                                                                                                                                        
  File "/task_runtime/src/transformers-4.2.1/src/transformers/models/bert/modeling_bert.py", line 956, in forward                                                                           
    past_key_values_length=past_key_values_length,   
  ...
  File "/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 2043, in embedding                                                                                                 
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)                                                                                            
       RuntimeError: CUDA error: device-side assert triggered

In other words, the base model (bert-base-cased) encounters an input with a larger sequence length than what it can handle (535 > 512).

Given the above, how do you get around it, and apply your method to entire documents? (i.e., the MS MARCO Document Ranking table)

Describe C-COIL approach

Just wondering if you could add somewhere a description of your MS MARCO submission "C-COIL + RoBERTa" from 14/07/2021. Which modifications to COIL have you made and what was the motivation?

How is document expansion helpful if p_max_len=192 in unicoil training and encoding command? Most MSMARCO passages are over 192 tokens

How is corpus-d2q is prepared? On what p_max_len is castorini/unicoil-d2q-msmarco-passage trained? Can I use p_max_len as 512 and encode using it?

1048585	7187158	35.926036089658744
1048585	7187160	35.790479123592384
1048585	7187155	35.65535098314285
1048585	7187157	34.09628629684448
1048585	7617404	33.498324900865555
1048585	3856131	31.57883720099926
1048585	7617413	31.314840689301487
1048585	7187156	31.123393774032593
1048585	7617411	30.926150113344196
1048585	353739	30.901350378990173

I would expect:

1048585	7187158	1
1048585	7187160	2
1048585	7187155	3
1048585	7187157	4
1048585	7617404	5
1048585	3856131	6
1048585	7617413	7
1048585	7187156	8
1048585	7617411	9
1048585	353739	10

Clearly it's no real problem as it's easy to fix locally, but I'm not sure if this was intended or not.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.