Git Product home page Git Product logo

openliveq's Introduction

openliveq

This is a python package for NTCIR-13 OpenLiveQ (http://www.openliveq.net/), and provides the following utilities:

  • Utility classes for handling the OpenLiveQ data
  • Feature extraction (e.g. TF-IDF, BM25, and language model)
  • Some tools for learning to rank by RankLib

Requirements

  • Python 3
  • MeCab

Installation

$ git clone https://github.com/mpkato/openliveq.git
$ cd openliveq
$ python setup.py install

MeCab Installation

MeCab is required to process Japanese texts.

Ubuntu

sudo aptitude install -y mecab libmecab-dev mecab-ipadic-utf8
pip install mecab-python3

An additional dictionary (mecab-ipadic-neologd) can be installed as follows:

git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd && sudo ./bin/install-mecab-ipadic-neologd -y
sudo sed -i 's/^dicdir.*/dicdir=\/usr\/lib\/mecab\/dic\/mecab-ipadic-neologd/g' /etc/mecabrc

Example

This example extracts features from each query-question pair, and applies learning to rank based on relevance estimated by a simple click model.

# install
$ git clone https://github.com/mpkato/openliveq.git
$ cd openliveq
$ python setup.py install

# prepare data (should be provided by the OpenLiveQ organizers)
$ mkdir data
### copy files ###
$ ls data
OpenLiveQ-clickthrough.tsv    OpenLiveQ-queries-train.tsv
OpenLiveQ-questions-test.tsv  OpenLiveQ-queries-test.tsv    OpenLiveQ-question-data.tsv   OpenLiveQ-questions-train.tsv

# load the data into a SQLite3 database
$ openliveq load data/OpenLiveQ-question-data.tsv \
> data/OpenLiveQ-clickthrough.tsv

    question_file:     data/OpenLiveQ-question-data.tsv
    clickthrough_file: data/OpenLiveQ-clickthrough.tsv

1967274 questions loaded
440163 clickthroughs loaded

# data validation
$ openliveq valdb
DB Validation
OK

# file validation
$ openliveq valfiles data
File Validation
data/OpenLiveQ-queries-train.tsv
data/OpenLiveQ-queries-test.tsv
data/OpenLiveQ-questions-train.tsv
data/OpenLiveQ-questions-test.tsv
OK

# parse the entire collection to obtain some statistics such as DF
$ openliveq parse data/OpenLiveQ-collection.json

output_file:         collection.json

...

The entire collection has been parsed
	The number of documents: 1967274
	The number of unique words: 1114773
	The number of words: 250871848    

# extract features from query-question pairs
$ openliveq feature data/OpenLiveQ-queries-train.tsv \
> data/OpenLiveQ-questions-train.tsv \
> data/OpenLiveQ-collection.json \
> data/OpenLiveQ-features-train.tsv

query_file:          data/OpenLiveQ-questions-train.tsv
query_question_file: data/OpenLiveQ-questions-train.tsv
collection_file:     data/OpenLiveQ-collection.json
output_file:         data/OpenLiveQ-features-train.tsv

Loading queries and questions ...

The collection statistics:
	The number of documents: 1967274
	The number of unique words: 1114773
	The number of words: 250871848

Extracting features ...

$ openliveq feature data/OpenLiveQ-queries-test.tsv \
> data/OpenLiveQ-questions-test.tsv \
> data/OpenLiveQ-collection.json \
> data/OpenLiveQ-features-test.tsv

query_file:          data/OpenLiveQ-questions-test.tsv
query_question_file: data/OpenLiveQ-questions-test.tsv
collection_file:     data/OpenLiveQ-collection.json
output_file:         data/OpenLiveQ-features-test.tsv

Loading queries and questions ...

The collection statistics:
	The number of documents: 1967274
	The number of unique words: 1114773
	The number of words: 250871848

Extracting features ...

# estimate relevance based on clickthrough data
$ openliveq relevance data/OpenLiveQ-relevances.tsv

output_file: data/OpenLiveQ-relevances.tsv
sigma:       10.0
max_grade:   4
topk:        10

# integrate relevance scores and features
$ openliveq judge data/OpenLiveQ-features-train.tsv \
> data/OpenLiveQ-relevances.tsv \
> data/OpenLiveQ-features-train-rel.tsv

# use RankLib for learning
$ wget https://sourceforge.net/projects/lemur/files/lemur/RankLib-2.1/RankLib-2.1-patched.jar/download -O RankLib.jar
$ java -jar RankLib.jar \
> -train data/OpenLiveQ-features-train-rel.tsv \
> -save data/OpenLiveQ-model.dat

# use RankLib for ranking
$ java -jar RankLib.jar \
> -load data/OpenLiveQ-model.dat \
> -rank data/OpenLiveQ-features-test.tsv \
> -score data/OpenLiveQ-scores-test.tsv

$ openliveq rank data/OpenLiveQ-features-test.tsv \
> data/OpenLiveQ-scores-test.tsv \
> data/OpenLiveQ-run.tsv

# results
$ cat data/OpenLiveQ-run.tsv
OLQ-9999    1167627151
OLQ-9999    1328077703
...

Tools

openliveq command is available after installation.

load

Usage: openliveq load [OPTIONS] QUESTION_FILE CLICKTHROUGH_FILE

  Load data into a SQLite3 database

  Arguments:
      QUESTION_FILE:     path to the question file
      CLICKTHROUGH_FILE: path to the clickthrough file

Options:
  -v, --verbose  increase verbosity.
  --help         Show this message and exit.

This command stores question and clickthrough data into a SQLite database at openliveq/db.sqlite3. See our homepage for the file formats: NTCIR-13 OpenLiveQ.

This step is necesaary before running the other commands, but only once.

valdb

Usage: openliveq valdb [OPTIONS]

  DB validation of the OpenLiveQ data

Options:
  --help  Show this message and exit.

This command validates question and clickthrough data stored in the SQLite database. This command is optional.

valfiles

Usage: openliveq valfiles [OPTIONS] DATA_DIR

  File validation of the OpenLiveQ data

  Arguments:
      DATA_DIR:          path to the OpenLiveQ data directory

Options:
  --help  Show this message and exit.

This command validates query and question files in a directory. This command is optional.

parse

Usage: openliveq parse [OPTIONS] OUTPUT_FILE

  Parse the entire corpus

  Arguments:
      OUTPUT_FILE:    path to the output file

Options:
  -v, --verbose  increase verbosity.
  --help         Show this message and exit.

This command parses the entire collection to obtain some statistics such as DF, and store them into OUTPUT_FILE. This command should be executed before feature command, and OUTPUT_FILE should be used as an argument for feature command.

feature

Usage: openliveq feature [OPTIONS] QUERY_FILE QUERY_QUESTION_FILE OUTPUT_FILE

  Feature extraction from query-question pairs

  Arguments:
      QUERY_FILE:          path to the query file
      QUERY_QUESTION_FILE: path to the file of query and question IDs
      COLLECTION_FILE:     path to the output file of the 'parse' command
      OUTPUT_FILE:         path to the output file

Options:
  -v, --verbose  increase verbosity.
  --help         Show this message and exit.

This command extracts features from query-question pairs described in QUERY_QUESTION_FILE and output the features in OUTPUT_FILE in the RankLib format. See the RankLib website for the file format: RankLib.

This package uses features 1-30 listed in Table 3, Tao Qin, Tie-Yan Liu, Jun Xu, Hang Li. LETOR: A benchmark collection for research on learning to rank for information retrieval, Information Retrieval, Volume 13, Issue 4, pp. 346-374, 2010.. In addition, features include the number of answers given, the number of page views, etc. See openliveq/features for details.

This command does not provide any relevance score for each query-question pair. Use relevance and judge commands to add relevance.

relevance

Usage: openliveq relevance [OPTIONS] OUTPUT_FILE

  Output relevance scores based on a very simple click model

  Arguments:
      OUTPUT_FILE: path to the output file

Options:
  --sigma FLOAT        used for estimating the examination probability based
                       on the rank.
  --max_grade INTEGER  maximum relevance grade (scores are squashed into [0,
                       max_grade])
  --topk INTEGER       only topk results are used (the default value 10 is
                       highly recommended).
  -v, --verbose        increase verbosity.
  --help               Show this message and exit.

This command estimates the relevance of each question based on the clickthrough data and output the relevance in OUTPUT_FILE. The click model used for the relevance estimation is a simplified position-based model (rf. Click Models for Web Search). Relevance of each question is estimated as follows:

Relevance(d_qr) = CTR_qr / exp(- r / sigma)

where d_qr is the r-th ranked document for query q, CTR_qr is the clickthrough rate of d_qr, and sigma is a parameter. This model assumes that the examination probability only depends on the rank.

Relevance(d_qr) is normalized so that the maximum score for a query becomes scale option value (default: 4).

The format of each line in the output is:

[Query ID]\t[Question ID]\t[Relevance grade]

judge

Usage: openliveq judge [OPTIONS] FEATURE_FILE RELEVANCE_FILE OUTPUT_FILE

  Generating training data with feature and relevance files

  Arguments:
      FEATURE_FILE:   path to the file generated by 'feature'
      RELEVANCE_FILE: path to the file generated by 'relevance'
      OUTPUT_FILE:    path to the output file

Options:
  --help  Show this message and exit.

This command concatenates two files generated by feature and relevance commands. More specifically, this adds the relevance scores in RELEVANCE_FILE to each query-question pair in FEATURE_FILE, and stores the results in OUTPUT_FILE.

rank

Usage: openliveq rank [OPTIONS] FEATURE_FILE SCORE_FILE OUTPUT_FILE

  Ranking test data by scores given by RankLib

  Arguments:
      FEATURE_FILE: path to the feature file for test data
      SCORE_FILE:   path to the score file generated by RankLib
      OUTPUT_FILE:  path to the output file

Options:
  --help  Show this message and exit.

This command ranks questions for each query in FEATURE_FILE based on scores in SCORE_FILE, and outputs the results in OUTPUT_FILE.

SCORE_FILE is typically a file generated by RankLib with options '-load', '-score', and '-save'. The format of each line in SCORE_FILE must be:

[X]\t[Y]\t[Score]

where [X] and [Y] are not used (assigned by RankLib). [Score] in i-th line is simply applied to a query-question pair in i-th line, FEATURE_FILE.

The format of each line in OUTPUT_FILE is:

[Query ID]\t[Question ID]

where the order represents the ranks of questions for each query. This format follows the submission format of NTCIR-13 OpenLiveQ, but note that a description line should be added to the top. For example, use the following commands before submission:

echo "Description of your system. Change me!" > your_result.tsv
cat YOUR_OUTPUT_FILE >> your_result.tsv

openliveq's People

Contributors

tmanabe avatar

Stargazers

Takumi ITO avatar  avatar Masao Takaku avatar  avatar Makoto P. Kato avatar  avatar

Watchers

Masao Takaku avatar Takehiro Yamamoto avatar Alex Soares avatar Makoto P. Kato avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.