Git Product home page Git Product logo

thuir_wsdm_cup's Introduction

WSDM Cup 2023 -- THUIR

This codebase contains source-code that we use to participate in the WSDM Cup 2023.

Features

Final features that we use include:

Feature ID Feature Name Feature Description
1 cross_encoder Fine-tune the pre-trained transformer model for 200 epochs with annotation data using BCE loss
2 bm25 BM25 score of title+content using Pyserini (k1=1.6, b=0.87, tuned on the fine-tune data)
3 query_length Length of the query
4 title_length Length of the title
5 content_length Length of the content
6 query_freq Frequency bucket of the query
7 ql Query likelihood score of title+content
8 prox-1 Averaged proximity score of query terms in title+content
9 prox-2 Averaged position of query terms appearing in title+content
10 prox-3 Number of query term pairs appearing in title+content within a distance of 5
11 prox-4 Number of query term pairs appearing in title+content within a distance of 10
12 prox-1-nonstop PROX-1 score of title+content after being filtered stopwords
13 prox-2-nonstop PROX-2 score of title+content after being filtered stopwords
14 prox-3-nonstop PROX-3 score of title+content after being filtered stopwords
15 prox-4-nonstop PROX-4 score of title+content after being filtered stopwords
16 tf-idf TF-IDF score of title+content w.r.t. the query
17 tf TF score of title+content w.r.t. the query
18 idf IDF score of title+content
19 bm25_title BM25 score of title using Pyserini (k1=1.6, b=0.87)
20 bm25_content BM25 score of content using Pyserini (k1=1.6, b=0.87)

Results

For Task 2: Pretraining for Web Search, we used all the aforementioned features except 14 and achieved DCG=10.04097 on the leaderboard.

Download

You can download the best checkpoint we have trained through the following entries:
Best checkpoint with pre-training (ctr+mlm loss): save_steps27000_6.31586.model.
Best checkpoint with fine-tuning (human label bce loss): save_steps143000_10.08166.model.

More

More details of our experiments will come soon in our competition papers. Please stay tuned.

thuir_wsdm_cup's People

Contributors

cshaitao avatar oneal2000 avatar xuanyuan14 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

thuir_wsdm_cup's Issues

Release of Feature Dataset?

Dear THUIR team,

First of all, thank you all for releasing the code of your WSDM cup submission!

Me and a colleague at the University of Amsterdam were wondering if you could make the datasets of your computed LTR features publicly available. The datasets would be of great help to us in a new project as a strong baseline. Would that be by any chance possible?

Kindest regards and all the best!
Philipp

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.