yasserg / jforests Goto Github PK

Automatically exported from code.google.com/p/jforests

Java 100.00%

jforests's Introduction

jforests is a Java library that implements many tree-based learning algorithms.

jforests can be used for regression, classification and ranking problems. The latest release can be downloaded from https://github.com/yasserg/jforests/releases

The following tutorial shows how jforests can be used for learning a ranking model using the LambdaMART algorithm.

Learning to Rank with LambdaMART

Data Sets Format

jforests uses the following format for its input data sets (same as the one used in SVMLight):

<line> .=. <relevance> qid:<qid> <feature>:<value> ... <feature>:<value> 
<relevance> .=. <integer>
<qid> .=. <positive integer>
<feature> .=. <positive integer>
<value> .=. <float>

For this tutorial, we will use the sample data set which is available here.

Converting Data Sets to Binary Format

In order to speed up the computations, jforests converts its input data sets to binary format. We are assuming that you have unzipped the above sample data set in a folder and are currently on that folder. You should have also downloaded the latest jforests jar file and renamed it to 'jforests.jar' and put it in the same folder.

The following command can be used for converting data sets to binary format:

java -jar jforests.jar --cmd=generate-bin --ranking --folder . --file train.txt --file valid.txt --file test.txt

As this command shows, we are converting 'train.txt', 'valid.txt', and 'test.txt' to binary format. As a result 'train.bin', 'valid.bin', and 'test.bin' are generated.

Learning the Ranking Model

Once the input data sets are converted to the binary format, a ranking model can be trained on them.

First you need to specify the parameters of your machine learning algorithm. The following is a sample set of parameters for the LambdaMART algorithm:

trees.num-leaves=7
trees.min-instance-percentage-per-leaf=0.25
boosting.learning-rate=0.05
boosting.sub-sampling=0.3
trees.feature-sampling=0.3

boosting.num-trees=2000
learning.algorithm=LambdaMART-RegressionTree
learning.evaluation-metric=NDCG

params.print-intermediate-valid-measurements=true

Create a 'ranking.properties' file in the current folder and save the above config in it.

Then the following command can be used for training a LambdaMART ensemble and storing it in the 'ensemble.txt' file:

java -jar jforests.jar --cmd=train --ranking --config-file ranking.properties --train-file train.bin --validation-file valid.bin --output-model ensemble.txt

Predicting Scores of Documents

Once you have the LambdaMART ensemble, you can use it for predicting scores of test documents. The following command performs this step and stores the results in the 'predcitions.txt' file.

java -jar jforests.jar --cmd=predict --ranking --model-file ensemble.txt --tree-type RegressionTree --test-file test.bin --output-file predictions.txt

Scores can then be used for measuring NDCG or other information retrieval measures.

Advanced Ranking Options

Jforests can be configured to change the used measure for LambdaMART using the learning.evaluation-metric entry in the ranking.properties file. Currently, NDCG is supported, as well as risk-sensitive evaluation measures such as URisk and TRisk - see RiskSensitiveLambdaMART.

Source Code

Source code is are available from the Github repository: https://github.com/yasserg/jforests

Citation Policy

If you use jforests for a research purpose, please use the following citation:

Y. Ganjisaffar, R. Caruana, C. Lopes, Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models, in SIGIR 2011, Beijing, China.

BibTeX:

@inproceedings{Ganji:2011:SIGIR,
	author = {Yasser Ganjisaffar and Rich Caruana and Cristina Lopes},
	title = {Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models},
	booktitle = {Proceedings of the 34th international ACM SIGIR conference on Research and development in Information},
	series = {SIGIR '11},
	year = {2011},
	isbn = {978-1-4503-0757-4},
	location = {Beijing, China},
	pages = {85--94},
	numpages = {10},
	doi = {http://doi.acm.org/10.1145/2009916.2009932},
	acmid = {2009932},
	publisher = {ACM},
	address = {New York, NY, USA},
}

If you use risk-sensitive learning to rank, please see RiskSensitiveLambdaMART for citation information.

License

Published under Apache License 2.0

jforests's People

Contributors

Stargazers

Watchers

jforests's Issues

NullPointException with subLearner.setTreeWeight running gradient boosting

Can someone provide a sample config file for gradient boosting? or better yet a test set that works?

I’ve been trying to run gradient boosting and think that either:

I don’t have it configured properly, or
there is a bug

I keep getting the following error:

Finished loading datasets.
java.lang.NullPointerException
at edu.uci.jforests.learning.boosting.GradientBoosting.learn(GradientBoosting.java:105)
at edu.uci.jforests.applications.ClassificationApp.run(ClassificationApp.java:244)
at edu.uci.jforests.applications.Runner.train(Runner.java:103)
at edu.uci.jforests.applications.Runner.main(Runner.java:222)

The error seems to be from trying to set the tree weight with a null.

jforests crashed with sample-training-data

What steps will reproduce the problem?
1. convert sample-ranking-data to binary format
2. train with sample-ranking-data using sample-ranking-config.properties

What is the expected output? What do you see instead?
jforests should work with given sample-training-data, but it crashed instead 
(see output below)

What version of the product are you using? On what operating system?
jforests-0.3.jar on centOS6.4

Please provide any additional information below.

[resources]$ java -jar jforests-0.3.jar --cmd=generate-bin --ranking --folder . 
--file train.txt
Generating binary files for ranking data sets...
Processing: ./train.txt
10000
20000
30000
40000
Loading values...  [Done in: 0 seconds.]
Making distributions...  [Done in: 0 seconds.]
Making bins...
Feature: 0, type: SHORT
Feature: 1, type: SHORT
Feature: 2, type: BYTE
Feature: 3, type: BYTE
Feature: 4, type: SHORT
Feature: 5, type: NULL
Feature: 6, type: NULL
Feature: 7, type: NULL
Feature: 8, type: NULL
Feature: 9, type: NULL
Feature: 10, type: SHORT
Feature: 11, type: SHORT
Feature: 12, type: SHORT
Feature: 13, type: SHORT
Feature: 14, type: SHORT
Feature: 15, type: SHORT
Feature: 16, type: SHORT
Feature: 17, type: SHORT
Feature: 18, type: SHORT
Feature: 19, type: SHORT
Feature: 20, type: SHORT
Feature: 21, type: SHORT
Feature: 22, type: SHORT
Feature: 23, type: SHORT
Feature: 24, type: SHORT
Feature: 25, type: SHORT
Feature: 26, type: SHORT
Feature: 27, type: SHORT
Feature: 28, type: SHORT
Feature: 29, type: SHORT
Feature: 30, type: SHORT
Feature: 31, type: SHORT
Feature: 32, type: SHORT
Feature: 33, type: SHORT
Feature: 34, type: SHORT
Feature: 35, type: SHORT
Feature: 36, type: SHORT
Feature: 37, type: SHORT
Feature: 38, type: SHORT
Feature: 39, type: SHORT
Feature: 40, type: SHORT
Feature: 41, type: SHORT
Feature: 42, type: SHORT
Feature: 43, type: BYTE
Feature: 44, type: SHORT
Feature: 45, type: BIT
  [Done in: 0 seconds.]
Making features...  [Done in: 0 seconds.]
Creating bin file...  [Done in: 0 seconds.]
[resources]$ java -jar ~/rank/jforests-0.3.jar --cmd=train --ranking 
--config-file sample-ranking-config.properties --train-file train.bin 
--output-model ensemble.txt
Loading datasets...
Finished loading datasets.
java.lang.NullPointerException
    at edu.uci.jforests.learning.boosting.LambdaMART.preprocess(LambdaMART.java:122)
    at edu.uci.jforests.learning.boosting.GradientBoosting.learn(GradientBoosting.java:97)
    at edu.uci.jforests.applications.ClassificationApp.run(ClassificationApp.java:244)
    at edu.uci.jforests.applications.Runner.train(Runner.java:100)
    at edu.uci.jforests.applications.Runner.main(Runner.java:247)
[resources]$ java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
[resources]$ uname -a
Linux centos 2.6.32-358.23.2.el6.x86_64 #1 SMP Wed Oct 16 18:37:12 UTC 2013 
x86_64 x86_64 x86_64 GNU/Linux

Original issue reported on code.google.com by [email protected] on 5 Nov 2013 at 2:28

ranking.valid-ndcg-truncation

Hi,
Just to make sure, is ranking.valid-ndcg-truncation actually reflects the 'k' of NDCG@k used to optimize on the training data ?

Thanks

Allow Jforests to be extended by naming the class of an app on the commandline

Unsupported major.minor version

What steps will reproduce the problem?

Running this command on any train.txt:

java -jar jforests.jar --cmd=generate-bin --ranking --folder . --file train.txt 

What is the expected output? What do you see instead?

Expected output=bin file

Error reported:

Exception in thread "main" java.lang.UnsupportedClassVersionError: 
edu/uci/jforests/applications/Runner : Unsupported major.minor version 51.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

What version of the product are you using? On what operating system?

0.3

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 2 Mar 2012 at 7:56

Support floats/doubles as labels

Currently, you cannot target a plain regression task, as binary file generation assumes that labels are integers. This is in contrast with the rest of the Jforests codebase.

license not specified

The old Google Code web site, https://code.google.com/p/jforests/ , says that this project is licensed under the Apache 2 license but the project has no license file and the Github makes no mention of the license. A lack of license likely discourages use of this library.

Error with RandomForest

I can build a model using RandomForest but i can't get the scores because i get some errors.
If i run with --tree-type=RegressionTree

Exception in thread "main" java.lang.NumberFormatException: For input string: "0.0"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at edu.uci.jforests.util.ArraysUtil.loadDoubleArrayFromLine(ArraysUtil.java:224)
at edu.uci.jforests.learning.trees.regression.RegressionTree.loadCustomData(RegressionTree.java:206)
at edu.uci.jforests.learning.trees.Ensemble.loadFromFile(Ensemble.java:133)
at edu.uci.jforests.applications.Runner.predict(Runner.java:142)
at edu.uci.jforests.applications.Runner.main(Runner.java:227)

and if a run with --tree-type=DecisionTree

Exception in thread "main" java.lang.Exception: Invalid input.
at edu.uci.jforests.util.ArraysUtil.loadDoubleMatrixFromLine(ArraysUtil.java:232)
at edu.uci.jforests.learning.trees.decision.DecisionTree.loadCustomData(DecisionTree.java:112)
at edu.uci.jforests.learning.trees.Ensemble.loadFromFile(Ensemble.java:133)
at edu.uci.jforests.applications.Runner.predict(Runner.java:144)
at edu.uci.jforests.applications.Runner.main(Runner.java:227)

Avoid generate-bin step

It would be good if training or ranking could be done directly without firstly 
generating the binary representations.

Original issue reported on code.google.com by [email protected] on 23 Apr 2014 at 8:24