Git Product home page Git Product logo

wsdm-cup-2024's Introduction

WSDM Cup 2024

1st Solution For Conversational Multi-Doc QA Workshop & International Challenge @ WSDM'24 - Xiaohongshu.Inc

Introduction

This repo contains the source code of our competition in WSDM Cup 2024: Conversational Multi-Doc QA

Please refer to our paper for details in this competition: The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA

Method Overview

  • SOLAR-10.7B-Instruct backbone
  • Hybrid Training
  • Noisy Document Filter
  • Model Ensemble

Environment

  1. Follow Installation for modelscope/swift to install swift.

  2. Install vllm

  3. Install deepspeed

  4. Install sklearn

  5. Install SentenceTransformers

Or you can run this: (Tested on V100 32G with CUDA 11.8, Ubuntu 20.04.1)

conda create -n swift python=3.10
conda activate swift
pip install ms-swift[all] -U
pip install vllm==0.3.1
pip install deepspeed
pip install scikit-learn
pip install sentence_transformers

Main package version:

python==3.10.13
ms-swift==1.6.1
scikit-learn==1.4.1.post1
sentence-transformers==2.3.1
torch==2.1.2
transformers==4.37.2
vllm==0.3.1

Data Processing

preprocess/data_format.py: Format data required for train and eval

preprocess/data_format_Pseudo.py: For hybrid training data

preprocess/score_train_eval(test).py: Calculate scores for noisy documents filter

preprocess/score_order.py: Interactive code to delete noisy documents

Training

Use LLM Framework ms-swift by ModelScope

Finetuning

runsh/solar_instruct_sft_template.sh

Inference

runsh/solar_instruct_infer_template.sh

Ensemble learning

merge/calculate_score.py: Calculate scores for ensemble learning

merge/merge_score.py: Ensemble results

Other

keyword: Try directly generating keywords or answers by GPT

multi_stage: Multi Stage LLM try (Not work)

Reproduce results on the leaderboard

You can find all intermediate files in result folder

Prepare models

  1. Download Pretrained Models From Huggingface

  2. Download Our 8 Finetuned LoRA Adapters From our huggingface repository (0.03 B Each)

So our model size is 10.7B + 0.14B + 0.03B * 8 = 11.08B, much fewer than 14 billion (14B) parameters.

  1. Put them in the right folder. The folder should look as follows:
└── checkpoints
    ├── v08-20240205-114459/
    ├── v10-20240205-114325/
    ├── v13-20240202-072530/
    ├── v13-20240206-111010/
    ├── v16-20240206-224659/
    ├── v27-20240209-133614/
    ├── v33-20240210-002918/
    └── v35-20240210-120550/
└── pretrained
    └── nomic-ai/nomic-embed-text-v1/
        ├── 1_Pooling/
        ├── config.json
        ├── config_sentence_transformers.json
        ├── configuration_hf_nomic_bert.py
        ├── .gitattributes
        ├── .locks/
        ├── modeling_hf_nomic_bert.py
        ├── model.safetensors
        ├── modules.json
        ├── onnx/
        ├── pytorch_model.bin
        ├── README.md
        ├── sentence_bert_config.json
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── vocab.txt
    └── upstage/SOLAR-10.7B-Instruct-v1.0/
        ├── config.json
        ├── generation_config.json
        ├── .gitattributes
        ├── .locks/
        ├── model-00001-of-00005.safetensors
        ├── model-00002-of-00005.safetensors
        ├── model-00003-of-00005.safetensors
        ├── model-00004-of-00005.safetensors
        ├── model-00005-of-00005.safetensors
        ├── model.safetensors.index.json
        ├── README.md
        ├── solar_logo.png
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── tokenizer.model

Inference Result

Run python data_format.py to preprocess original test data.

Then run shell script in the runsh folder

bash runsh/v08-20240205-114459.sh
bash runsh/v10-20240205-114325.sh
bash runsh/v13-20240202-072530.sh
bash runsh/v13-20240206-111010.sh
bash runsh/v16-20240206-224659.sh
bash runsh/v27-20240209-133614.sh
bash runsh/v33-20240210-002918.sh
bash runsh/v35-20240210-120550.sh
  1. You can modify CUDA device at the beginning of each shell script CUDA_VISIBLE_DEVICES=
  2. The result files are saved in the merge folder, which should look as follows:
└── merge
    ├── v08-20240205-114459.jsonl
    ├── v10-20240205-114325.jsonl
    ├── v13-20240202-072530.jsonl
    ├── v13-20240206-111010.jsonl
    ├── v16-20240206-224659.jsonl
    ├── v27-20240209-133614.jsonl
    ├── v33-20240210-002918.jsonl
    └── v35-20240210-120550.jsonl

Besides, the results above are as follows:

File Word-level ROUGE-L Character-level ROUGE-L Keywords Recall
v08-20240205-114459 0.45532953438881013 0.6143454883849857 0.6824189095928223
v10-20240205-114325 0.456275615214309 0.6149276913541135 0.6817805383022769
v13-20240202-072530 0.4554468517276402 0.6141346993379754 0.6827095609704305
v13-20240206-111010 0.456388581088847 0.6149210447203279 0.6840088655306036
v16-20240206-224659 0.45375515045837794 0.613359666771279 0.6879538939321544
v27-20240209-133614 0.45574561117381773 0.6145520850027292 0.6826942984551678
v33-20240210-002918 0.4559195951083145 0.6141543510329665 0.6865596963423041
v35-20240210-120550 0.45573339341665703 0.614208192382808 0.6813332802463232

So even if they are not ensembled, each of them is still way ahead of the second place.

Ensemble

First, calculate the embedding score

python calculate_score.py

Note that this program is accelerated by torch.multiprocessing, you can modify the number of processes near num_group = 16. (It works well in V100 32G)

Then generate final result,

python merge_score.py

It will generate emb_a_s_8_0_1_2_3_4_5_6_7.zip in the root folder, which is our final result.

Word-level ROUGE-L Character-level ROUGE-L Keywords Recall
0.465360141853671 0.6208371209722543 0.6953475871954128

Citation

If you find our work helpful, please consider citing the following paper:

@misc{li2024place,
      title={The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA}, 
      author={Yiming Li and Zhao Zhang},
      year={2024},
      eprint={2402.18385},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contacts

Zhao Zhang: [email protected]

Yiming Li: [email protected]

wsdm-cup-2024's People

Contributors

zhangzhao219 avatar ming-er avatar yanqiangmiffy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.