Official code of our work, Representation Learning for Resource-Constrained Keyphrase Generation.
This repository contains the following functionalities:
- Datasets used in our experiments: KP20k, Inspec, Krapivin, NUS, SemEval, KPTimes, and three 20k-document subsets of KP20k that we use as .
- Code for intermediate representation learning and finetuning with BART
- Code for evaluating on the benchmarks
- Our raw predictions on the scientific benchmarks
If you find our work useful, please consider citing:
@article{wu2022representation,
title={Representation Learning for Resource-Constrained Keyphrase Generation},
author={Wu, Di and Ahmad, Wasi Uddin and Dev, Sunipa and Chang, Kai-Wei},
journal={arXiv preprint arXiv:2203.08118},
year={2022}
}
- Install the requirements. We recommend installing them in a separate virtual environment as our experiments require a customized version of fairseq.
pip install -r requirements.txt
- Install fairseq from source
git clone --branch model-experiment-0.10.2 https://github.com/xiaowu0162/fairseq.git cd fairseq pip install --editable ./ cd ..
- Prepare the datasets.
cd data/scikp bash run.sh cd ../kp20k-20k bash run.sh cd ../kptimes bash run.sh cd ../..
To finetune BART on keyphrase generation, first run preprocessing by
cd finetuning_fairseq
bash preprocess.sh
Then, training can be done by
bash run_train.sh GPU_IDs DATASET_NAME [BART_PATH]
Note that
- The hyperparameter settings are for training on kp20k-20k-1/2/3 with a single GPU. If you use more than 1 GPU, please make sure to reduce
UPDATE_FREQ
to achieve the same batch size. - We recommend using an effective batch size of 64 or 32 for finetuning, where
effective batch size = NUM_GPUs * PER_DEVICE_BSZ * UPDATE_FREQ
- The supported
DATASET_NAME
arekp20k, kptimes, kp20k-20k-1, kp20k-20k-2, kp20k-20k-3
BART_PATH
will default to the pre-trained BART model. You can start with other BART checkpoints (e.g., the intermediate representations pretrained with TI or SSR) by providing the correspondingcheckpoint.pt
files as parameters. To run randomly initialized BART, please remove the--restore-file
flag in the script.
With a trained keyphrase generation model, you can run generation and evaluation by
bash run_test.sh GPU_ID DATASET_NAME SAVE_DIR
- DATASET_NAME: use
kp20k
to evaluate on all five scientific datasets. - SAVE_DIR: the path to the checkpoint (e.g.,
checkpoint_best.pt
). dataset_hypotheses.txt
contains the model's raw predictions.dataset_predictions.txt
contains postprocessed predictions.results_log_dataset.txt
contains all the scores.
We provide code to run representation learning methods discussed in the paper. Note that all the hyperparameter settings are for a single GPU. We recommend running with multiple GPUs. In that case, please make sure to adjust UPDATE_FREQ
accordingly to achieve the desired batch size.
cd intermediate_learning/title_generation
bash preprocess_titlegen.sh
bash run_train.sh GPU_IDs kp20k-titlegen
cd intermediate_learning/text_infilling
bash preprocess_text_infilling.sh
bash run_train.sh GPU_IDs kp20k-lm
SSR and SSP requires offline span prediction as the first step:
cd intermediate_learning/salient_span_recovery/bm25-spans
bash find_spans_bm25.sh
Then, you can run SSR-M or SSR-D by
bash preprocess_ssr.sh
bash run_train.sh GPU_IDs kp20k-ssr-[m,d]
Or you can run SSP-M or SSP-D by
bash preprocess_ssr.sh
bash preprocess_ssp.sh
bash run_train.sh GPU_IDs kp20k-ssp-[m,d]
Warning: running BM25 retrieval takes a long time. We suggest first trying on a small subset of KP20k, and parallelizing the search manually when running on the entire KP20k train set.
We share the predictions from fine-tuning our best SSR-D model on kp20k-20k here. After downloading and uncompressing the file, you can directly run the evaluate.sh
in the BART-SSR-D-preds
folder to get the scores.