Git Product home page Git Product logo

jpq's Introduction

JPQ

Repo for our CIKM'21 Full paper, Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. JPQ greatly improves the efficiency of Dense Retrieval. It is able to compress the index size by 30x with negligible performance loss. It also provides 10x speedup on CPU and 2x speedup on GPU in query latency.

Here is the effectiveness - index size (log-scale) tradeoff on MSMARCO Passage Ranking. In contrast with trading index size for ranking performance, JPQ achieves high ranking effectiveness with a tiny index.

Results at different trade-off settings are shown below.

MS MARCO Passage Ranking MS MARCO Document Ranking

JPQ is still very effective even if the compression ratio is over 100x and outperforms baselines at different compression ratio settings. For more details, please refer to our paper. If you find this repo useful, please do not save your star and cite our work:

@article{zhan2021jointly,
  title={Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance},
  author={Zhan, Jingtao and Mao, Jiaxin and Liu, Yiqun and Guo, Jiafeng and Zhang, Min and Ma, Shaoping},
  journal={arXiv preprint arXiv:2108.00644},
  year={2021}
}

Models and Indexes

You can download trained models and indexes from our dropbox link. After open this link in your browser, you can see two folder, doc and passage. They correspond to MSMARCO passage ranking and document ranking. There are also two folders in either of them, trained_models and indexes. trained_models are the trained query encoders, and indexes are trained PQ indexes. Note, the pid in the index is actually the row number of a passage in the collection.tsv file instead of the official pid provided by MS MARCO. Different query encoders and indexes correspond to different compression ratios. For example, the query encoder named m32.tar.gz or the index named OPQ32,IVF1,PQ32x8.index means 32 bytes per doc, i.e., 768*4/32=96x compression ratio.

Requirements

To install requirements, run the following commands:

git clone [email protected]:jingtaozhan/JPQ.git
cd JPQ
python setup.py install

Preprocess

Here are the commands to for preprocessing/tokenization.

If you do not have MS MARCO dataset, run the following command:

bash download_data.sh

Preprocessing (tokenizing) only requires a simple command:

python preprocess.py --data_type 0; python preprocess.py --data_type 1

It will create two directories, i.e., ./data/passage/preprocess and ./data/doc/preprocess. We map the original qid/pid to new ids, the row numbers in the file. The mapping is saved to pid2offset.pickle and qid2offset.pickle, and new qrel files (train/dev/test-qrel.tsv) are generated. The passages and queries are tokenized and saved in the numpy memmap file.

Note: JPQ, as long as our SIGIR'21 models, utilizes Transformers 2.x version to tokenize text. However, when Transformers library updates to 3.x or 4.x versions, the RobertaTokenizer behaves differently. To support REPRODUCIBILITY, we copy the RobertaTokenizer source codes from 2.x version to star_tokenizer.py. During preprocessing, we use from star_tokenizer import RobertaTokenizer instead of from transformers import RobertaTokenizer. It is also necessary for you to do this if you use our JPQ model on other datasets.

Retrieval

You can download the query encoders and indexes from our dropbox link and run the following command to efficiently retrieve documents:

python ./run_retrieval.py \
    --preprocess_dir ./data/doc/preprocess \
    --mode dev \
    --index_path PATH/TO/OPQ96,IVF1,PQ96x8.index \
    --query_encoder_dir PATH/TO/m96/ \
    --output_path ./data/doc/m96.dev.tsv \
    --batch_size 32 \
    --topk 100

It also has an option --gpu_search for fast GPU search.

Run the following command to evaluate the ranking results on MSMARCO document dataset.

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/m96.dev.tsv 100

You will get

Eval Started
#####################
MRR @100: 0.4008114949611788
QueriesRanked: 5193
#####################

Training

JPQ is initialized by STAR. STAR trained on passage ranking is available here. STAR trained on document ranking is available here.

First, use STAR to encode the corpus and run OPQ to initialize the index. For example, on document ranking task, please run:

python ./run_init.py \
  --preprocess_dir ./data/doc/preprocess/ \
  --model_dir ./data/doc/star \
  --max_doc_length 512 \
  --output_dir ./data/doc/init \
  --subvector_num 96

On passage ranking task, you can set the max_doc_length to 256 for faster inference.

Now you can train the query encoder and PQ index. For example, on document ranking task, the command is

python run_train.py \
    --preprocess_dir ./data/doc/preprocess \
    --model_save_dir ./data/doc/train/m96/models \
    --log_dir ./data/doc/train/m96/log \
    --init_index_path ./data/doc/init/OPQ96,IVF1,PQ96x8.index \
    --init_model_path ./data/doc/star \
    --lambda_cut 10 \
    --centroid_lr 1e-4 \
    --train_batch_size 32

--gpu_search is optional for fast gpu search during training. lambda_cut should be set to 200 for passage ranking task. centroid_lr is different for different compression ratios. Let M be the number of subvectors. centroid_lr equals to 5e-6 for M = 16/24, 2eโˆ’5 for M = 32, and 1eโˆ’4 for M = 48/64/96. The number of training epochs is set to 6. In fact, the performance is already quite satisfying after 1 or 2 epochs. Each epoch costs less than 2 hours on our machine.

jpq's People

Contributors

jingtaozhan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.