WSDM Cup 2024

1st Solution For Conversational Multi-Doc QA Workshop & International Challenge @ WSDM'24 - Xiaohongshu.Inc

Introduction

This repo contains the source code of our competition in WSDM Cup 2024: Conversational Multi-Doc QA

Please refer to our paper for details in this competition: The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA

Method Overview

SOLAR-10.7B-Instruct backbone
Hybrid Training
Noisy Document Filter
Model Ensemble

Environment

Follow Installation for modelscope/swift to install swift.
Install vllm
Install deepspeed
Install sklearn
Install SentenceTransformers

Or you can run this: (Tested on V100 32G with CUDA 11.8, Ubuntu 20.04.1)

conda create -n swift python=3.10
conda activate swift
pip install ms-swift[all] -U
pip install vllm==0.3.1
pip install deepspeed
pip install scikit-learn
pip install sentence_transformers

Main package version:

python==3.10.13
ms-swift==1.6.1
scikit-learn==1.4.1.post1
sentence-transformers==2.3.1
torch==2.1.2
transformers==4.37.2
vllm==0.3.1

Data Processing

preprocess/data_format.py: Format data required for train and eval

preprocess/data_format_Pseudo.py: For hybrid training data

preprocess/score_train_eval(test).py: Calculate scores for noisy documents filter

preprocess/score_order.py: Interactive code to delete noisy documents

Training

Use LLM Framework ms-swift by ModelScope

Finetuning

runsh/solar_instruct_sft_template.sh

Inference

runsh/solar_instruct_infer_template.sh

Ensemble learning

merge/calculate_score.py: Calculate scores for ensemble learning

merge/merge_score.py: Ensemble results

Other

keyword: Try directly generating keywords or answers by GPT

multi_stage: Multi Stage LLM try (Not work)

Reproduce results on the leaderboard

You can find all intermediate files in result folder

Prepare models

Download Pretrained Models From Huggingface
- upstage/SOLAR-10.7B-Instruct-v1.0 (10.7 B)
- nomic-ai/nomic-embed-text-v1 (0.14 B)
Download Our 8 Finetuned LoRA Adapters From our huggingface repository (0.03 B Each)

So our model size is 10.7B + 0.14B + 0.03B * 8 = 11.08B, much fewer than 14 billion (14B) parameters.

Put them in the right folder. The folder should look as follows:

└── checkpoints
    ├── v08-20240205-114459/
    ├── v10-20240205-114325/
    ├── v13-20240202-072530/
    ├── v13-20240206-111010/
    ├── v16-20240206-224659/
    ├── v27-20240209-133614/
    ├── v33-20240210-002918/
    └── v35-20240210-120550/
└── pretrained
    └── nomic-ai/nomic-embed-text-v1/
        ├── 1_Pooling/
        ├── config.json
        ├── config_sentence_transformers.json
        ├── configuration_hf_nomic_bert.py
        ├── .gitattributes
        ├── .locks/
        ├── modeling_hf_nomic_bert.py
        ├── model.safetensors
        ├── modules.json
        ├── onnx/
        ├── pytorch_model.bin
        ├── README.md
        ├── sentence_bert_config.json
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── vocab.txt
    └── upstage/SOLAR-10.7B-Instruct-v1.0/
        ├── config.json
        ├── generation_config.json
        ├── .gitattributes
        ├── .locks/
        ├── model-00001-of-00005.safetensors
        ├── model-00002-of-00005.safetensors
        ├── model-00003-of-00005.safetensors
        ├── model-00004-of-00005.safetensors
        ├── model-00005-of-00005.safetensors
        ├── model.safetensors.index.json
        ├── README.md
        ├── solar_logo.png
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── tokenizer.model

Inference Result

Run python data_format.py to preprocess original test data.

Then run shell script in the runsh folder

bash runsh/v08-20240205-114459.sh
bash runsh/v10-20240205-114325.sh
bash runsh/v13-20240202-072530.sh
bash runsh/v13-20240206-111010.sh
bash runsh/v16-20240206-224659.sh
bash runsh/v27-20240209-133614.sh
bash runsh/v33-20240210-002918.sh
bash runsh/v35-20240210-120550.sh

You can modify CUDA device at the beginning of each shell script CUDA_VISIBLE_DEVICES=
The result files are saved in the merge folder, which should look as follows:

└── merge
    ├── v08-20240205-114459.jsonl
    ├── v10-20240205-114325.jsonl
    ├── v13-20240202-072530.jsonl
    ├── v13-20240206-111010.jsonl
    ├── v16-20240206-224659.jsonl
    ├── v27-20240209-133614.jsonl
    ├── v33-20240210-002918.jsonl
    └── v35-20240210-120550.jsonl

Besides, the results above are as follows:

File	Word-level ROUGE-L	Character-level ROUGE-L	Keywords Recall
v08-20240205-114459	0.45532953438881013	0.6143454883849857	0.6824189095928223
v10-20240205-114325	0.456275615214309	0.6149276913541135	0.6817805383022769
v13-20240202-072530	0.4554468517276402	0.6141346993379754	0.6827095609704305
v13-20240206-111010	0.456388581088847	0.6149210447203279	0.6840088655306036
v16-20240206-224659	0.45375515045837794	0.613359666771279	0.6879538939321544
v27-20240209-133614	0.45574561117381773	0.6145520850027292	0.6826942984551678
v33-20240210-002918	0.4559195951083145	0.6141543510329665	0.6865596963423041
v35-20240210-120550	0.45573339341665703	0.614208192382808	0.6813332802463232

So even if they are not ensembled, each of them is still way ahead of the second place.

Ensemble

First, calculate the embedding score

python calculate_score.py

Note that this program is accelerated by torch.multiprocessing, you can modify the number of processes near num_group = 16. (It works well in V100 32G)

Then generate final result,

python merge_score.py

It will generate emb_a_s_8_0_1_2_3_4_5_6_7.zip in the root folder, which is our final result.

Word-level ROUGE-L	Character-level ROUGE-L	Keywords Recall
0.465360141853671	0.6208371209722543	0.6953475871954128

Citation

If you find our work helpful, please consider citing the following paper:

@misc{li2024place,
      title={The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA}, 
      author={Yiming Li and Zhao Zhang},
      year={2024},
      eprint={2402.18385},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contacts

Zhao Zhang: [email protected]

Yiming Li: [email protected]

deepmindx-ai / wsdm-cup-2024 Goto Github PK

wsdm-cup-2024's Introduction