Git Product home page Git Product logo

parade's Introduction

PARADE

DOI

This repository contains the code for our paper

  • PARADE: Passage Representation Aggregation for Document Reranking PDF

If you're interested in running PARADE for the TREC-COVID challenge (submitted with tag mpiid5 from Round 2), please check out the covid branch.

If you find this paper/code useful, please cite:

@article{li2020parade,
  title={PARADE: Passage Representation Aggregation for Document Reranking},
  author={Li, Canjia and Yates, Andrew and MacAvaney, Sean and He, Ben and Sun, Yingfei},
  journal={arXiv preprint arXiv:2008.09093},
  year={2020}
}

Introduction

PARADE (PAssage Representation Aggregation for Document rE-ranking) is an end-to-end document reranking model based on the pre-trained language models.

We support the following PARADE variants:

  • PARADE-Avg (named cls_avg in the code)
  • PARADE-Attn (named cls_attn in the code)
  • PARADE-Max (named cls_max in the code)
  • PARADE (named cls_transformer in the code)

We support two instantiations of pre-trained models:

  • BERT
  • ELECTRA

Getting Started

To run PARADE, there're two steps ahead. We give a detailed example on how to run the code on the Robust04 dataset using the title query.

1. Data Preparation

To run a 5-fold cross-validation, data for 5 folds are required. The standard qrels, query, trec_run files can be accomplished by Anserini, please check out their notebook for further details. Then you need to split the documets into passages, write them into TFrecord files. The corpus file can also be extracted by Anserini to form a docno \t content paired text. Then run

scripts/run.convert.data.sh

You should be able to see 5 sub-folders generated in the output_dir folder, with each contains a train file and a test file. Note that if you're going to run the code on TPU, you need to upload the training/testing data to Google Cloud Storage (GCS). Everything is prepared now!

2. Model Traning and Evaluation

For all the pre-trained models, we first fine-tune them on the MSMARCO passage collection. This is IMPORTANT, as it can improve the nDCG@20 by 2 points generally. To figure out the way of doing that, please check out dl4marco-bert. If you want to escape this fine-tuning step, check out these fine-tuned models on the MSMARCO passage ranking dataset. The fine-tuned model will be the initialized model in PARADE. Just pass it to the BERT_ckpt argument in the following snippet.

Now train the model:

scripts/run.reranking.sh

The model performance will automatically output on your screen. When evaluating the title queries on the Robust04 collecting, it outputs

P_20                    all     0.4604
ndcg_cut_20             all     0.5399

Useful Resources

  • Fine-tuned models on the MSMARCO passage ranking dataset:
Model L / H MRR on MSMARCO DEV Path
ELECTRA-Base 12 / 768 0.3698 Download
BERT-Base 12 / 768 0.3637 Download
\ 10 / 768 0.3622 Download
\ 8 / 768 0.3560 Download
BERT-Medium 8 / 512 0.3520 Download
BERT-Small 4 / 512 0.3427 Download
BERT-Mini 4 / 256 0.3247 Download
\ 2 / 512 0.3160 Download
BERT-Tiny 2 /128 0.2600 Download

(Config files and Vocabulary file are available Here๏ผ‰

  • Our run files on the Robust04 and GOV2 collections: Robust04, GOV2.

FAQ

  • How to get the raw text?

If you bother getting the raw text from Anserini, you can also replace the anserini/src/main/java/io/anserini/index/IndexUtils.java file by the extra/IndexUtils.java file in this repo, then re-build Anserini (version 0.7.0). Below is how we fetch the raw text

anserini_path="path_to_anserini"
index_path="path_to_index"
# say you're given a BM25 run file run.BM25.txt
cut -d ' ' -f3 run.BM25.txt | sort | uniq > docnolist
${anserini_path}/target/appassembler/bin/IndexUtils -dumpTransformedDocBatch docnolist -index ${index_path}

then you get the required raw text in the directory that contains docnolist. Alternatively, you can refer to the search_pyserini.py file in the covid branch and fetch the docs using pyserini.

  • How to run a significance test?

To do a significance test, just configurate the trec_eval path in the evaluation.py file. Then simply run the following command, here we compare PARADE with BERT-MaxP:

python evaluation.py \
  --qrels /data/anserini/src/main/resources/topics-and-qrels/qrels.robust04.txt \
  --baselines /data2/robust04/reruns/title/bertmaxp.dai/bertbase_onMSMARCO/merge \
  --runs /data2/robust04/reruns/title/num-segment-16/electra_base_onMSMARCO_cls_transformer/merge_epoch3

then it outputs

OrderedDict([('P_20', '0.4277'), ('ndcg_cut_20', '0.4931')])
OrderedDict([('P_20', '0.4604'), ('ndcg_cut_20', '0.5399')])
OrderedDict([('P_20', 1.2993259211924425e-11), ('ndcg_cut_20', 8.306604295574242e-09)])

The upper two lines are the sanity checks of your run performance values. The last line shows the p-values. PARADE achieves significant improvement over BERT-MaxP (p < 0.01) !

  • How to run knowledge distillation for PARADE?

Please follow the fine-tuning steps first. Then run the following command:

scripts/run.kd.sh

It outputs the following results with regard to PARADE using the BERT-small model (4 layers)!

P_20                    all     0.4365
ndcg_cut_20             all     0.5098

Acknowledgement

Some snippets of the codes are borrowed from NPRF, Capreolus, dl4marco-bert, SIGIR19-BERT-IR.

parade's People

Contributors

canjiali avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

parade's Issues

BERT_ckpt

Hi,

Would you please tell me which one should assign to the export BERT_ckpt:
model.ckpt-400000.data-00000-of-00001
model.ckpt-400000.index
model.ckpt-400000.meta

Maybe it should be ./bert_models_onMSMARCO/vanilla_bert_base_on_MSMARCO/model.ckpt-400000

Am I right?

Thanks in advance

I cannot make even similar results

After I was hardly able to run the BERT-mini model.
I ran it on Indri-5.17 1000 lengths result list (1000 per query).
Now, I probably miss something, but reranking idea is to rerank the same documents.
The results of Indri (which are about the same as Anserini) for Robust04 are much less than the results you've, and I cannot see how by using ReRanking for 100 out of 1000 documents one can get such results.
These are the results for 1000 docs per query for Indri, as the QL attains :
P_5 all 0.4627
P_10 all 0.4048
P_15 all 0.3687
P_20 all 0.3438
P_30 all 0.2941
P_100 all 0.1710
P_200 all 0.1146
P_500 all 0.0610
P_1000 all 0.0361

Now, when I followed the steps I got far worse results for the BERT-mini model.
I am trying to find what is wrong here, as I've seen your run files.
I am using the Indri without stemming version, and for the documents I got out the doc vectors of Indri.
So I believe that should get results which are less the top but what I get is that the reranking just make it worse.

I am missing some details, like how many documents do you take out of the collection by Anserini per query ?
Do you take all the documents of the qrels for training ?
Do you feed the convert scripts with the Raw Text as is ? cause for the gov2 it contains a lot of HTML tags, etc.

I would like you advice.

ClueWeb12B

Hi @canjiali, just wanted to ask if you tried running PARADE on the TREC queries of ClueWeb12-B13? Thank you

Generate Data

Hi,

Thanks for brilliant work.
I want to run your code and I have several issues in data generation part:

  1. Generally run.robust04.title.bm25.txt and run.robust04.title.bm25.txt.docno.uniq_rawdocs.txt are generated with generate_data.py or I should generate them manually?

If I should generate them manually:

  1. In https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/anserini_robust04_demo.ipynb the file run.robust04.bm25.txt is generated that I do not know whether it is equivalent with run.robust04.title.bm25.txt in run.convert.data.sh or not?

  2. run.robust04.bm25.txt contains 242339 lines, I think maybe you sort them and extract just first 1000 rows, don't you?

  3. I do not know what is run.robust04.title.bm25.txt.docno.uniq_rawdocs.txt file and how it generates. I think maybe its a text file that each line is (doc number + \t + doc text) based on unique documents id in run.robust04.title.bm25.txt, isn't it? Also, you told the corpus file can also be extracted by Anserini to form a docno \t content paired text. Would you please explain more about it?

  4. When you tell "you need to split the documets into passages, write them into TFrecord files", it means that generate_data.py does it or we should do sth manually?

Thanks

PARADE-CNN

hi @canjiali,
Can you please clarify what did you mean in PARADE-CNN?
Window-size 2Xd (pytorch conv2d) transforms every two vectors to a scalar.
Maybe you meant window size 2X1 (with stride 2X1)?

Questions on Reproducing the Results

Hi, thank you for posting your great work!

I have several questions:
1. What k1 value did you use for BM25 on Robust04 description?
I followed the paper and set b=0.6, and I used Anserini's default k1 value 0.9. The results I got are P@20=0.3233 and nDCG@20=0.3902 (while in the paper P@20=0.3345, nDCG@20=0.4058).
I was able to get exactly the same results as reported using k1=0.9, b=0.6 using BM25+RM3 on Robust04 description.

2. Have you tried running the model on GPU?
I tried running electra base model on a single 2080ti, but about 10GB gpu memory is taken up even before training starts. Therefore, I was only able to use batch_size=1 with max_seq_length=256, which I believe is too small to be stable.
Normally, the machine should be able to handle BERT base with max_seq_length=256 using batch_size=16.

3. Loading BERT base checkpoint gave me an error "tensorflow.python.framework.errors_impl.NotFoundError: Key bert/embeddings/LayerNorm/beta not found in checkpoint".
This doesn't happen if I load an electra model. To load the BERT model, I modified run.reranking.sh and changed pretrained_model flag to bert and BERT_ckpt path to the downloaded BERT model. Is there anything I'm missing?

Some information of my environment:
python 3.7.9
tensorflow 1.15.0 (not sure if this will be an issue)

Thank you very much in advance!

Feasibility of Running Code on GPU

Hi,

Thanks for all of your complete answers in other issues. Also, thanks for your awesome work shared.

I used the following config:
number of epoch = 1
number of fold = 1
model checkpoint = bert_mini (uncased_L_4_H_256_A_4)
batch_size = 64

The memory consumption is as follows:
CPU RAM = 80G
GPU RAM = 6.5G

Your model is training after about 10 hours for just 1 epoch and 1 fold and it did not finish

I changed batch_size to 128 and then the memory consumption is as follows:
CPU RAM = 120G
GPU RAM = 6.5G

Your model is training after about 14 hours for just 1 epoch and 1 fold and up to now did not finish

Would you please tell me are you sure I can train your code on GPU? (according to #14)

Thanks in advance

How to get GOV2 datasets.

Hi, I'm confused about how to get GOV2 datasets. I would appreciate it if you could provide me with the specific process. Thanks a lot.

Problems when trying to re-rank

While trying to re-rank via GPU (not TPU) I entagled OOM issues, it seems like the code assumes infinite memory, or maybe a tensor flow bug. I find it hard to believe you had access to TPU with a TeraByte of TPU memory ....
I had to change batch-size to 4 (!) in order that the code will not crash.
Several issues :
[1] Do you have this code for tensorflow 2 ? As most of things here are depricated, and there are a lot of issues. It seems like GPU is not handled correctly, many issues there, and the estimator is not working correctly either.
[2] Many things are not synchronized, like in the last part of the re-rank script what is the local-dir exactly ? How do I have access to this information ? What is it supposed to be (as all of it is not documented) ?
[3] All the models you give here, do not contain the necessary parameters for running, so I had to combine from several places. The pre-trained model do not contain the config json file, and do not contain the vocabulary which is needed.

About issue #27

Hi,

Would you please tell me do you pay for it? (According to my info I should pay about 2.5 $ hourly)
Also, would you please tell me you used tpu v.3 or tpu v.2?
I read about it but I would appreciate it if you give me any advice about using GCP and challenges you faced to run your code.

Thanks in advance,
Kind Regards

Preprocessing

Hi,

I used pyserini to get raw text from robust04. An example is:

FBIS4-40260 BFN [Unattributed report: "President Takes Emergency Measures in Fight against Crime Orgy"] [Text] On 14 June, President B. Yeltsin signed the edict "On Urgent Measures to Protect the Population from Banditry and other Manifestations of Organized Crime." Bearing in mind the acuteness of this problem, we are publishing the document in its entirety. [Edict begins] For the purposes of protecting the lives and property interests of citizens, ensuring the security of society and the state, and to implement the Russian Federation Federal Program for Stepping Up the Fight Against Crime for 1994-1995 prior to the adoption by the Russian Federation Federal Assembly of legislative acts in the sphere of combating crime, I decree: 1. That a system of urgent measures shall be enacted to combat banditry and other serious crimes committed by organized criminal groups; --where there is sufficient evidence of an individual's involvement in a gang or other organized criminal group suspected of committing serious crimes and by agreement with the prosecutor's office prior to the institution of criminal proceedings, expert appraisals may be conducted, the results of which may be viewed as evidence in criminal cases of the given category and preliminary investigations may be authorized into the financial and economic activity and property and financial status not only of the individual in question but also his relatives or other persons residing with him in the previous five years and also in respect of physical and legal persons or public associations whose property, resources, or name may have been exploited or used by the suspect; --for the preparation and implementation of investigations and the prevention and detection of crimes, active use may be made of information resulting from operational investigations, which may be regarded in the prescribed manner as evidence in criminal cases of this category; --as preventive measures against those suspected or accused of the aforementioned crimes, the following shall not be valid: recognizance not to leave, personal guaranty of a defendant's appearance, guaranty provided by social organizations, or a bond, and the suspect may be detained for a period of up to 30 days; --bank and commercial confidentiality shall be no bar to the obtaining in the prescribed manner by the organs of the prosecutor's office, internal affairs, counterintelligence, or the tax police of information or documents relating to financial and economic activity, deposits, and dealings relating to the accounts of physical or legal persons involved in the commission of bandit attacks or other serious crimes carried out by organized criminal groups; --the authorized representatives of the internal affairs and counterintelligence organs shall have the right to inspect the buildings and premises of enterprises, institutions, and organizations irrespective of forms of ownership, acquaint themselves with documentation characterizing their activity, and also examine transport facilities and their drivers and passengers. 2. The leaders of the executive of the components of the Federation shall; --draw up a list of cities and individual localities which are to be kept under special surveillance in connection with the prevalence on those territories of cases of banditry and other manifestations of organized crime; --instruct the MVD [Ministry of Internal Affairs], the Internal Affairs administrations, the Federal Counterintelligence Service, [FCS], and FCS administrations to transfer the personnel of the internal affairs organs and the counterintelligence organs of those cities and individual localities to an intensified form of official operational activity for the performance of special operations in the fight against banditry and other manifestations of organized crime; --ensure the targeted use of financial resources as material incentives for staffers of the internal affairs and counterintelligence organs and servicemen of the internal troops participating in special operations to combat banditry and other manifestations of organized crime and also those working in the intensified form of operations in cities and individual localities; --elaborate and implement through the legislative organs of the components of the Federation additional measures of a legal character for the intensification of the struggle against banditry and other manifestations of organized crime. 3. The Russian Federation General Prosecutor's Office, the Russian Federation Ministry of Internal Affairs, and the Russian Federation FCS shall, within 10 days, elaborate and send to the leaders of subordinate organs a joint instruction on the procedure for the practical implementation of the norms of the present Edict. 4. The Russian Federation General Prosecutor's office is instructed to establish permanent prosecutor's oversight of the observance of legislation in the implementation of the present Edict. 5. The present Edict shall be sent to the Federation Council and the State Duma of the Russian Federation Federal Assembly. 6. The present shall go into force from the moment of its signing. [no signature as published]

I did pre-processing like Lower casing, Remove other special characters, Remove numbers and and Remove stop and short words. Now I have:

FBIS4-40260 text bfn unattributed report president takes emergency measures fight crime orgy text june president yeltsin signed edict urgent measures protect population banditry manifestations organized crime bearing mind acuteness problem publishing document entirety edict begins purposes protecting lives property interests citizens ensuring security society state implement russian federation federal program stepping fight crime prior adoption russian federation federal assembly legislative acts sphere combating crime decree system urgent measures shall enacted combat banditry serious crimes committed organized criminal groups sufficient evidence individuals involvement gang organized criminal group suspected committing serious crimes agreement prosecutors office prior institution criminal proceedings expert appraisals may conducted results may viewed evidence criminal cases given category preliminary investigations may authorized financial economic activity property financial status individual question also relatives persons residing previous five years also respect physical legal persons public associations whose property resources name may exploited used suspect preparation implementation investigations prevention detection crimes active use may made information resulting operational investigations may regarded prescribed manner evidence criminal cases category preventive measures suspected accused aforementioned crimes following shall valid recognizance leave personal guaranty defendants appearance guaranty provided social organizations bond suspect may detained period days bank commercial confidentiality shall bar obtaining prescribed manner organs prosecutors office internal affairs counterintelligence tax police information documents relating financial economic activity deposits dealings relating accounts physical legal persons involved commission bandit attacks serious crimes carried organized criminal groups authorized representatives internal affairs counterintelligence organs shall right inspect buildings premises enterprises institutions organizations irrespective forms ownership acquaint documentation characterizing activity also examine transport facilities drivers passengers leaders executive components federation shall draw list cities individual localities kept special surveillance connection prevalence territories cases banditry manifestations organized crime instruct mvd ministry internal affairs internal affairs administrations federal counterintelligence service fcs fcs administrations transfer personnel internal affairs organs counterintelligence organs cities individual localities intensified form official operational activity performance special operations fight banditry manifestations organized crime ensure targeted use financial resources material incentives staffers internal affairs counterintelligence organs servicemen internal troops participating special operations combat banditry manifestations organized crime also working intensified form operations cities individual localities elaborate implement legislative organs components federation additional measures legal character intensification struggle banditry manifestations organized crime russian federation general prosecutors office russian federation ministry internal affairs russian federation fcs shall within days elaborate send leaders subordinate organs joint instruction procedure practical implementation norms present edict russian federation general prosecutors office instructed establish permanent prosecutors oversight observance legislation implementation present edict present edict shall sent federation council state duma russian federation federal assembly present shall force moment signing signature published text

It is ok? Which pre-processing should I do to reach to your inputs?

Thanks in advance

Memory

Hi,

I sent the first part of your code, that is generate_data, on 500G cpu memory but after generating tfrecords for 5 queries I received out of memory error. How much of memory does it need?

Thanks

Dev data in each fold

Hi

In https://github.com/canjiali/PARADE/blob/f330f12a0104c591d871aa54b3b0022dadaef512/generate_data.py#L322 up to https://github.com/canjiali/PARADE/blob/f330f12a0104c591d871aa54b3b0022dadaef512/generate_data.py#L360 you just generated train and test data and it did not contain validation(dev) data.

Would you please tell me how many queries does train data contain and how many queries test data contain? I want to know if train and test data all together contain all 250 queries of robust04?

I am looking forward to hearing from you,
Thanks in advance

Run on TPU

Hi,

You told me "you should be able to run one fold within 12 hours on a tpu v2".

Would you please tell me if I use bert base model each fold could be run within 12 hours on a tpu v. 2?

Also, would you please tell me at the end of each fold your implementation generates test data scores? If so, which fold has the best results?

Thanks in advance,
Kind Regards

Run Time

Hi,

I have Quadro RTX 6000 GPU. Would you please tell me how much is your estimation of run time?

Thanks in advance

Pre-trained BERT

Hi,

export BERT_PRETRAINED_DIR=/content/drive/MyDrive/PARADE-master-2
export BERT_ckpt=${BERT_PRETRAINED_DIR}/vanilla_bert_tiny_on_MSMARCO/model.ckpt-1600000

I received this error:

Unsuccessful TensorSliceReader constructor: Failed to get matching files on /content/drive/MyDrive/PARADE-master-2/vanilla_bert_base_on_MSMARCO/model.ckpt-1600000: Unimplemented: File system scheme '[local]' not implemented (file: '/content/drive/MyDrive/PARADE-master-2/vanilla_bert_base_on_MSMARCO/model.ckpt-1600000')

Would you please guide me what the reason is?

Thanks in advance,
Kind regards

Smaller sizes ELECTRA models

Hi @canjiali,
Have you experimented with smaller sizes of the ELECTRA model?
If so, are the MS MARCO pretrained models available to initialize PARADE?
Otherwise, could you please refer to the code you used to fine tune the ELECTRA model on MS MARCO? (I believe that this is my answer)
Thank you

About Issue #16

Hi,

In Issue #16, would you please tell me since I want to have scores of train data, when I use unshuffled train data, train_qid_list in these lines of code should be changed in

# training config

If yes, would you please tell me what I should change?

(I replied on the related question in issue #16 and I would appreciate it if you check it)

Thanks,
Zahra

Multiple GPU's scenario

Hello and thank you @canjiali for sharing your code for the PARADE model.

I would like to extend your code to the multiple GPU's scenario. I have successfully ran the whole pipeline (using the pretrained models) on a single GPU.

I'm new to the tensorflow library, so any guidance of yours will be much appreciated, specifically, with modifying BERT's create_optimizer() function, which is what im stuck at.

I believe that what i'm looking for is an equivalent of tf.tpu.CrossShardOptimizer(optimizer) for the multiple GPU's scenario, but I was not able to find any "straight forward" replacements.

Thank you again.

num of tokens taken from a passage

@canjiali Hello,
I wanted to ask about the "convert_to_bert_input" function (bert/tokenization), and specifically, this line. When add_cls=True we do need to subtract 2 for both CLS and SEP, but when we use this function to convert a passage to tokens we call it with add_cls=False, so we actually need to subtract 1 for SEP only.

It is also visible in the produced tensors:
image

whereas if you fix this (by subtracting 1 only if add_cls=False):
image

notice that now another token is added before the SEP token (102).

What do you think?
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.