Repo for our CIKM'21 Full paper, Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. JPQ greatly improves the efficiency of Dense Retrieval. It is able to compress the index size by 30x with negligible performance loss. It also provides 10x speedup on CPU and 2x speedup on GPU in query latency.
Here is the effectiveness - index size (log-scale) tradeoff on MSMARCO Passage Ranking. In contrast with trading index size for ranking performance, JPQ achieves high ranking effectiveness with a tiny index.
Results at different trade-off settings are shown below.
MS MARCO Passage Ranking | MS MARCO Document Ranking |
---|---|
JPQ is still very effective even if the compression ratio is over 100x and outperforms baselines at different compression ratio settings. For more details, please refer to our paper. If you find this repo useful, please do not save your star and cite our work:
@article{zhan2021jointly,
title={Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance},
author={Zhan, Jingtao and Mao, Jiaxin and Liu, Yiqun and Guo, Jiafeng and Zhang, Min and Ma, Shaoping},
journal={arXiv preprint arXiv:2108.00644},
year={2021}
}
You can download trained models and indexes from our dropbox link. After open this link in your browser, you can see two folder, doc
and passage
. They correspond to MSMARCO passage ranking and document ranking. There are also two folders in either of them, trained_models
and indexes
. trained_models
are the trained query encoders, and indexes
are trained PQ indexes. Note, the pid
in the index is actually the row number of a passage in the collection.tsv
file instead of the official pid provided by MS MARCO. Different query encoders and indexes correspond to different compression ratios. For example, the query encoder named m32.tar.gz
or the index named OPQ32,IVF1,PQ32x8.index
means 32 bytes per doc, i.e., 768*4/32=96x
compression ratio.
To install requirements, run the following commands:
git clone [email protected]:jingtaozhan/JPQ.git
cd JPQ
python setup.py install
Here are the commands to for preprocessing/tokenization.
If you do not have MS MARCO dataset, run the following command:
bash download_data.sh
Preprocessing (tokenizing) only requires a simple command:
python preprocess.py --data_type 0; python preprocess.py --data_type 1
It will create two directories, i.e., ./data/passage/preprocess
and ./data/doc/preprocess
. We map the original qid/pid to new ids, the row numbers in the file. The mapping is saved to pid2offset.pickle
and qid2offset.pickle
, and new qrel files (train/dev/test-qrel.tsv
) are generated. The passages and queries are tokenized and saved in the numpy memmap file.
Note: JPQ, as long as our SIGIR'21 models, utilizes Transformers 2.x version to tokenize text. However, when Transformers library updates to 3.x or 4.x versions, the RobertaTokenizer behaves differently.
To support REPRODUCIBILITY, we copy the RobertaTokenizer source codes from 2.x version to star_tokenizer.py. During preprocessing, we use from star_tokenizer import RobertaTokenizer
instead of from transformers import RobertaTokenizer
. It is also necessary for you to do this if you use our JPQ model on other datasets.
You can download the query encoders and indexes from our dropbox link and run the following command to efficiently retrieve documents:
python ./run_retrieval.py \
--preprocess_dir ./data/doc/preprocess \
--mode dev \
--index_path PATH/TO/OPQ96,IVF1,PQ96x8.index \
--query_encoder_dir PATH/TO/m96/ \
--output_path ./data/doc/m96.dev.tsv \
--batch_size 32 \
--topk 100
It also has an option --gpu_search
for fast GPU search.
Run the following command to evaluate the ranking results on MSMARCO document dataset.
python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/m96.dev.tsv 100
You will get
Eval Started
#####################
MRR @100: 0.4008114949611788
QueriesRanked: 5193
#####################
JPQ is initialized by STAR. STAR trained on passage ranking is available here. STAR trained on document ranking is available here.
First, use STAR to encode the corpus and run OPQ to initialize the index. For example, on document ranking task, please run:
python ./run_init.py \
--preprocess_dir ./data/doc/preprocess/ \
--model_dir ./data/doc/star \
--max_doc_length 512 \
--output_dir ./data/doc/init \
--subvector_num 96
On passage ranking task, you can set the max_doc_length
to 256 for faster inference.
Now you can train the query encoder and PQ index. For example, on document ranking task, the command is
python run_train.py \
--preprocess_dir ./data/doc/preprocess \
--model_save_dir ./data/doc/train/m96/models \
--log_dir ./data/doc/train/m96/log \
--init_index_path ./data/doc/init/OPQ96,IVF1,PQ96x8.index \
--init_model_path ./data/doc/star \
--lambda_cut 10 \
--centroid_lr 1e-4 \
--train_batch_size 32
--gpu_search
is optional for fast gpu search during training. lambda_cut
should be set to 200 for passage ranking task. centroid_lr
is different for different compression ratios.
Let M be the number of subvectors. centroid_lr
equals to 5e-6 for M = 16/24
, 2eโ5 for M = 32
, and 1eโ4 for M = 48/64/96
. The number of training epochs is set to 6. In fact, the performance is already quite satisfying after 1 or 2 epochs. Each epoch costs less than 2 hours on our machine.