Git Product home page Git Product logo

coil's People

Contributors

luyug avatar mxueguang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

coil's Issues

training data for UniCoil - links not working

Hi :)

I am looking for the data, for running the training script for the uniCoil model. The links under "Resource" section is expired/does not work.

Where can I find the data elsewhere?

Dataset error when encoding document

I find some of the encoding output-dir is empty because the error happens, while the others are normal and filled with cls&token file.

Traceback (most recent call last):
File "run_marco.py", line 303, in
main()
File "run_marco.py", line 217, in main
data_args.encode_in_path, tokenizer, p_max_len=data_args.p_max_len
File "/nfs/users/wangyile/coil/marco_datasets.py", line 126, in init
data_files=path_to_json,
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/load.py", line 589, in load_dataset
path, script_version=script_version, download_config=download_config, download_mode=download_mode, dataset=True
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/load.py", line 267, in prepare_module
local_path = cached_path(file_path, download_config=download_config)
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/utils/file_utils.py", line 308, in cached_path
use_etag=download_config.use_etag,
File "/nfs/users/wangyile/anaconda3/envs/coil/lib/python3.7/site-packages/datasets/utils/file_utils.py", line 487, in get_from_cache
raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/json/json.py

Small typo and bug?

Hi @luyug, Great work!!!!

I am trying to replicate COIL, there is some typo I noticed.

  1. in README.md Encoding section
    --token_dim 768 \  
    --cls_dim 32 \ 

should be 32, 768 instead?

  1. super().create_optimizer_and_scheduler(self.args.warmup_steps)

    num_training_steps should be passed into the function rather than warmup_steps?

Padding Tokens - in the inverted list index

do_encode - will pad the passages to maximum passage length
reps will now include the representation for the padding token - which will then be added to the index
the padding token in the index affects the score because any query containing any padding will now match that token

Should the padding token 0 be removed from the index after sharding or during the sharding process?

Training error

When I followed the usage to train COIL, some error accurred:
Traceback (most recent call last): File "run_marco.py", line 301, in <module> main() File "run_marco.py", line 81, in main parser = HfArgumentParser((ModelArguments, DataArguments, COILTrainingArguments)) File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 52, in __init__ self._add_dataclass_arguments(dtype) File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 93, in _add_dataclass_arguments elif hasattr(field.type, "__origin__") and issubclass(field.type.__origin__, List): File "/root/miniconda3/lib/python3.8/typing.py", line 774, in __subclasscheck__ return issubclass(cls, self.__origin__) TypeError: issubclass() arg 1 must be a class
I followed the environment in README, and don't know how to fix this error ..

Reproducing dense retriever results

Hello! In the paper, you report a dense retriever that you train in Table 1 and 2 ("Dense (our train)"). Is the code reproduce this result in this repo? And if so, do you have any pointers on how to train/evaluate one?

Thanks!

pyarrow.lib.ArrowNotImplementedError during training phrase

I ran these commands in Google Colab with GPU

!wget http://boston.lti.cs.cmu.edu/luyug/coil/msmarco-psg/psg-train.tar.gz
!tar xfz psg-train.tar.gz
!git clone https://github.com/luyug/COIL
!pip install transformers datasets

! cd COIL && python run_marco.py --output_dir model --model_name_or_path bert-base-uncased --do_train --save_steps 4000 --train_dir ../psg-train --q_max_len 16 --p_max_len 128 --fp16 --per_device_train_batch_size 8 --train_group_size 8 --cls_dim 768 --token_dim 32 --warmup_ratio 0.1 --learning_rate 5e-6 --num_train_epochs 5 --overwrite_output_dir --dataloader_num_workers 16 --no_sep --pooling max 

This is the output I got:

fatal: destination path 'COIL' already exists and is not an empty directory.
Requirement already satisfied: transformers in /usr/local/lib/python3.7/dist-packages (4.10.2)
Requirement already satisfied: datasets in /usr/local/lib/python3.7/dist-packages (1.11.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.0.12)
Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.10.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers) (21.0)
Requirement already satisfied: huggingface-hub>=0.0.12 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.16)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.62.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.19.5)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.45)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (5.4.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.6.4)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from huggingface-hub>=0.0.12->transformers) (3.7.4.3)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers) (2.4.7)
Requirement already satisfied: fsspec>=2021.05.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2021.8.1)
Requirement already satisfied: pyarrow!=4.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.12.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.7/dist-packages (from datasets) (2.0.2)
Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.1.5)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2021.5.30)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.5.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.0.1)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (7.1.2)
09/12/2021 02:05:20 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True
09/12/2021 02:05:20 - INFO - __main__ -   Training/evaluation parameters COILTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=16,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_encode=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-06,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=model/runs/Sep12_02-05-20_2992d74c8c9d,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=5.0,
output_dir=model,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=model,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=model,
save_on_each_node=False,
save_steps=4000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.0,
)
09/12/2021 02:05:20 - INFO - __main__ -   Model params ModelArguments(model_name_or_path='bert-base-uncased', config_name=None, tokenizer_name=None, cache_dir=None, token_dim=32, cls_dim=768, token_rep_relu=False, token_norm_after=False, cls_norm_after=False, x_device_negatives=False, pooling='max', no_sep=True, no_cls=False, cls_only=False)
09/12/2021 02:05:20 - INFO - filelock -   Lock 140242889433168 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
Downloading: 100% 570/570 [00:00<00:00, 476kB/s]
09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889433168 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889394384 acquired on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
Downloading: 100% 28.0/28.0 [00:00<00:00, 33.2kB/s]
09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889394384 released on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889467600 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
Downloading: 100% 232k/232k [00:00<00:00, 1.10MB/s]
09/12/2021 02:05:22 - INFO - filelock -   Lock 140242889467600 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
09/12/2021 02:05:22 - INFO - filelock -   Lock 140242856865744 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
Downloading: 100% 466k/466k [00:00<00:00, 3.41MB/s]
09/12/2021 02:05:23 - INFO - filelock -   Lock 140242856865744 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
09/12/2021 02:05:23 - INFO - filelock -   Lock 140242835342416 acquired on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock
Downloading: 100% 440M/440M [00:12<00:00, 35.5MB/s]
09/12/2021 02:05:36 - INFO - filelock -   Lock 140242835342416 released on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
09/12/2021 02:05:38 - WARNING - datasets.builder -   Using custom data configuration default-ac64881b8f58639a
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-ac64881b8f58639a/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...
Traceback (most recent call last):
  File "run_marco.py", line 302, in <module>
    main()
  File "run_marco.py", line 146, in main
    data_args, data_args.train_path, tokenizer=tokenizer,
  File "/content/COIL/marco_datasets.py", line 37, in __init__
    'passage': [datasets.Value('int32')],
  File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 852, in load_dataset
    use_auth_token=use_auth_token,
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 616, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 693, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1163, in _prepare_split
    generator, unit=" tables", leave=False, disable=bool(logging.get_verbosity() == logging.NOTSET)
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1185, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables
    yield (file_idx, batch_idx), self._cast_classlabels(pa_table)
  File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/json/json.py", line 76, in _cast_classlabels
    [pa_table[name] for name in self.config.features], schema=self.config.schema
  File "pyarrow/table.pxi", line 1515, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 553, in pyarrow.lib._sanitize_arrays
  File "pyarrow/array.pxi", line 328, in pyarrow.lib.asarray
  File "pyarrow/table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast
  File "/usr/local/lib/python3.7/dist-packages/pyarrow/compute.py", line 281, in cast
    return call_function("cast", [arr], options)
  File "pyarrow/_compute.pyx", line 465, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 294, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<qid: string, query: list<item: int64>> to struct using function cast_struct

Error with loading dataset

The error I get while running run_msmarco.py

Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry']

I have downloaded the psg-train and extracted them. THe path_to_tsv variable in GroupedMarcoTrainDataset has all the .json files,

Retrieval latency is very large with one thread

Hi, thank you for sharing this codes.
I tested the latency of COIL using the retriever-fast.py with one thread and one shard. Batch size is set to one. The cpu info is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. However, the query latency is roughly 4 seconds, which is substantially larger than 0.38s reported in the paper. I wonder why this happens. Does the paper use multi-threads to evaluate the latency?

Default training command - Issues when encountering documents longer than 512

Running the command under the training section of the README, the program fails in the first optimization step with the following message:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [535,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Which is thrown from the following:

  ...                                                                                                                                        
  File "/task_runtime/src/transformers-4.2.1/src/transformers/models/bert/modeling_bert.py", line 956, in forward                                                                           
    past_key_values_length=past_key_values_length,   
  ...
  File "/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 2043, in embedding                                                                                                 
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)                                                                                            
       RuntimeError: CUDA error: device-side assert triggered

In other words, the base model (bert-base-cased) encounters an input with a larger sequence length than what it can handle (535 > 512).

Given the above, how do you get around it, and apply your method to entire documents? (i.e., the MS MARCO Document Ranking table)

Describe C-COIL approach

Just wondering if you could add somewhere a description of your MS MARCO submission "C-COIL + RoBERTa" from 14/07/2021. Which modifications to COIL have you made and what was the motivation?

Question about COIL-full

Awesome idea and exiting exp result.
Still, I am confused about the implement of COIL-full, when doing dense retrieval, can we do ANN search to speed up by using FAISS, or brute-force search indeed ? What's the implement in the paper experiment?

How to use GPU to retrieve?

Thank you for sharing the codes. COIL achieves very impressive retrieval performance. I wonder how to use GPU for retrieval.

TSV's aren't eval compliant

Just wanted to ping you to let you know that the hosted TSV files (at least the dev ones) for COIL aren't in the correct format for evaluation.

EG:

1048585	7187158	35.926036089658744
1048585	7187160	35.790479123592384
1048585	7187155	35.65535098314285
1048585	7187157	34.09628629684448
1048585	7617404	33.498324900865555
1048585	3856131	31.57883720099926
1048585	7617413	31.314840689301487
1048585	7187156	31.123393774032593
1048585	7617411	30.926150113344196
1048585	353739	30.901350378990173

I would expect:

1048585	7187158	1
1048585	7187160	2
1048585	7187155	3
1048585	7187157	4
1048585	7617404	5
1048585	3856131	6
1048585	7617413	7
1048585	7187156	8
1048585	7617411	9
1048585	353739	10

Clearly it's no real problem as it's easy to fix locally, but I'm not sure if this was intended or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.