glycerine / sofia-ml Goto Github PK

View Code? Open in Web Editor NEW

61.0 4.0 22.0 824 KB

Automatically exported from code.google.com/p/sofia-ml

License: Apache License 2.0

Makefile 2.73% C++ 95.27% Perl 2.00%

sofia-ml's Introduction

sofia-ml

Project homepage: http://code.google.com/p/sofia-ml/

==Introduction==

The suite of fast incremental algorithms for machine learning (sofia-ml) can be used for training models for classification or ranking, using several different techniques. This release is intended to aid researchers and practitioners who require fast methods for classification and ranking on large, sparse data sets.

Supported learners include:

* Pegasos SVM
* Stochastic Gradient Descent (SGD) SVM
* Passive-Aggressive Perceptron
* Perceptron with Margins
* ROMMA

These learners can be configured for classification and ranking, with several sampling methods available.

This implementation gives very fast training times. For example, 100,000 Pegasos SVM training iterations can be performed on data from the CCAT task from the RCV1 benchmark data set (with roughly 780,000 examples) in 0.1 CPU seconds on an ordinary 2.4GHz laptop, with no loss in classification performance compared with other SVM methods. On LETOR learning to rank benchmark tasks, training time with 100,000 Pegasos SVM rank steps complete 0.2 CPU seconds on an ordinary laptop.

The primary computational bottleneck is actually reading the data off of disk; sofia-ml reads and parses data from disk substantially faster than other SVM packages we tested. For example, sofia-ml can read and parse data nearly 10 times faster than the reference Pegasos implementation by Shalev-Shwartz, and nearly 3 times faster than svm_perf by Joachims.

This package provides a commandline utility for training models and using them to predict on new data, and also exposes an API for model training and prediction. The underlying libraries for data sets, weight vectors, and example vectors are also provided for researchers wishing to use these classes to implement other algorithms.

==Quick Start==

These quick-start instructions assume the use of the unix/linux commandline, with g++ installed. There are no external code dependencies.

Step 1 Check out the code:

> svn checkout http://sofia-ml.googlecode.com/svn/trunk/sofia-ml sofia-ml-read-only

Step 2 Compile the code:

> cd sofia-ml-read-only/src/
> make
> ls ../sofia-ml
# Executable should be in main sofia-ml-read-only directory.

# If the above did not succeed, run the unit tests to help locate the problem:
> make clean
> make all_test

Step 3 Test the code:

> cd ..
> ./sofia-ml
# This should display the set of commandline flags and descriptions.

# Train a model on the demo training data.
> ./sofia-ml --learner_type pegasos --loop_type stochastic --lambda 0.1 --iterations 100000 --dimensionality 150000 --training_file demo/demo.train --model_out demo/model
# This should display something like the following:
Reading training data from: demo/demo.train
Time to read training data: 0.056134
Time to complete training: 0.075364
Writing model to: demo/model
Done.

# Test the model on the demo data.
> ./sofia-ml --model_in demo/model --test_file demo/demo.train --results_file demo/results.txt
# Should display the following:
Reading model from: demo/model
Done.
Reading test data from: demo/demo.train
Time to read test data: 0.046729
Time to make test prediction results: 0.000844
Writing test results to: demo/results.txt
Done.

# Examine a few results in the results file:
> head -5 demo/results.txt
# Format is: <prediction value>\t<label from test file>. Each line in the results
# file corresponds to the same line (in order) in the test file.
1.02114 1
1.18046 1
-1.24609 -1
-1.12822 -1
-1.41046 -1
# Note that exact results may vary slightly because these algorithms train
# by randomly sampling one example at a time.

# Evaluate the results:
> perl eval.pl demo/results.txt
# Should display something like:

Results for demo/results.txt:

Accuracy 0.9880 (using threshold 0.00) (988/1000)
Precision 0.9719 (using threshold 0.00) (311/320)
Recall 0.9904 (using threshold 0.00) (311/314)
ROC area: 0.999406

Total of 1000 trials.

# Note that this evaluation script has limited functionality. For more
# options, we recommend using the perf software by Rich Caruana (developed fo
# the KDD Cup 2004), available at: http://kodiak.cs.cornell.edu/kddcup/software.html

==Data Format==

This package uses the popular SVM-light sparse data format.

<class-label> <feature-id>:<feature-value> ... <feature-id>:<feature-value>\n
<class-label> qid:<optional-query-id> <feature-id>:<feature-value> ... <feature-id>:<feature-value>\n
<class-label> <feature-id>:<feature-value> ... <feature-id>:<feature-value># Optional comment or extra data, following the optional "#" symbol.\n

The feature id's are expected to be in ascending numerical order. The lowest allowable feature-id is 1 (0 is reserved for the bias term internally.) Any feature not specified is assumed to have value 0 to allow for sparse representation.

The class label for test data is required but not used; it's okay to put in a dummy placeholder value such as 0 for test data. For binary-class classification problems, the training labels should be 1 or -1. For ranking problems, the labels may be any numeric value, with higher values being judged as "more preferred".

Currently, the comment string is not used. However, it is available for use in other algorithms, and can also be useful to aid in bookkeeping of data files.

Examples:

# Class label is 1, feature 1 has value 1.2, feature 2 (not listed) has value 0,
# and feature 3 has value -0.5.
1 1:1.2 3:-0.5

# Class label is -1, belongs to qid 3, and all feature values are zero except
# for feature 5011 with value 1.2.
-1 qid:3 5011:1.2

# Class label is -1, feature 1 has value 7, comment string is
# "This example is especially interesting."
-1 1:7 3:-0.5#This example is especially interesting.

==Commandline Details==

File Input and Output

--model_in
* Read in a model from this file before doing any training or testing.

--model_out
* Write the model to this file when done with everything.

--training_file
* File to be used for training. When set, causes model training to occur.

--test_file
* File to be used for testing. When set, causes current model (either loaded from --model_in or trained from --training_file to be tested on test data.

--results_file
* File to which to write predictions, when --test_file is used. Results for each line are in the format <prediction>\t<label from test file>\n and correspond line-by-line with the examples form the --test_file.

Learning Options

--learner_type
* Type of learner to use.
* Options are:
o pegasos
Use the Pegasos SVM learning algorithm. --lambda sets the regularization parameter, with values closer to zero giving less regularization. Note that Pegasos enforces a hard constraint that the model weight vector must lie within an L2 ball of radius at most 1/sqrt(lambda). Also relies on --eta_type.
o sgd-svm
Use the SGD-SVM learning algorithm. --lambda sets the regularization parameter, with values closer to zero giving less regularization. Also relies on --eta_type
o passive-aggressive
Use the Passive Aggressive Perceptron learning algorithm. --passive-aggressive-c sets the largest step size to be taken on any update step; this operates as a capacity term with values closer to zero encouraging simpler models. --passive-aggressive-lambda will force the model weight vector to lie within an L2 ball of radius 1/sqrt(passive-aggressive-lambda)
o margin-perceptron
Use the Perceptron with Margins algorithm. --perceptron-margin-size sets the update margin. When set to 0, this is exactly equivalent to the classical Perceptron by Rosenblatt. When set to 1, this is equivalent to optimizing SVM hinge-loss without regularization. Increasing values may give additional tolerance to noise. Also relies on --eta_type.
o romma
Use the ROMMA algorithm. No parametert to set.
o logreg-pegasos
Use Logistic Regression with Pegasos updates; we optimize logistic loss and enforce Pegasos-style regularization and constraints, with --lambda being the regularization parameter. Also relies on --eta_type.
* Default: pegasos

--loop_type
* Type of sampling loop to use for training, controlling how examples are selected.
* Options are:
o stochastic
Perform normal stochastic sampling for stochastic gradient descent, for training binary classifiers. On each iteration, pick a new example uniformly at random from the data set.
o balanced-stochastic
Perform a balanced sampling from positives and negatives in data set. For each iteration, samples one positive example uniformly at random from the set of all positives, and samples one negative example uniformly at random from the set of all negatives. This can be useful for training binary classifiers with a minority-class distribution.
o rank
Perform indexed sampling of candidate pairs for pairwise learning to rank. Useful when there are examples from several different qid groups.
o roc
Perform indexed sampling to optimize ROC Area.
o query-norm-rank
Perform sampling of candidate pairs, giving equal weight to each qid group regardless of its size. Currently this is implemented with rejection sampling rather than indexed sampling, so this may run more slowly.
* Default: stochastic

--eta_type
* Type of update for learning rate to use.
* Options are:
o basic
On the i-th iteration, the learning rate eta is set to: 1000 / (i + 1000)
o pegasos
On the i-th iteration, the learning rate eta is set to: 1 / (i * lambda)
o constant
Always use learning rate eta of 0.02.
* Default: pegasos

--dimensionality
* Index id of largest feature index in training data set, plus one.
* Default: 2^17 = 131072

--iterations
* Number of stochastic gradient steps to take.
* Default: 100000

--lambda
* Value of lambda for SVM regularization, used by both Pegasos SVM and SGD-SVM.
* Default: 0.1

--passive_aggressive_c
* Maximum size of any step taken in a single passive-aggressive update.

--passive_aggressive_lambda
* Lambda for pegasos-style projection for passive-aggressive update.
* When set to 0 (default) no projection is performed.

--perceptron_margin_size
* Width of margin for perceptron with margins.
* Default of 1 is equivalent to unregularized SVM-loss.

--hash_mask_bits
* When set to a non-zero value, causes the use of a hashed weight vector with hashed cross product features. This allows learning on conjunction of features, at some increase in computational cost. Note that this flag must be set both in training and testing to function properly.
* The size of the hash table is set to 2^--hash_mask_bits.
* Default value of 0 shows that hash cross products are not used.

Other Options

--random_seed
* When set to non-zero value, use this seed instead of seed from system clock.
* This can be useful in testing and in parameter tuning.
* Default: 0

--training_objective
* Compute value of objective function on training data, after training.
* Default is not to do this.

==References==

If you use this source code for scientific research, please cite the following:

* D. Sculley. Large Scale Learning to Rank. NIPS Workshop on Advances in Ranking, 2009. Presents the indexed sampling methods used learning to rank, including the rank and roc loops.

Additional reading and references:

* K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. J. Mach. Learn. Res., 7, 2006. Presents the Passive-Aggressive Perceptron algorithm.

* T. Joachims. Optimizing search engines using clickthrough data. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. Presents the RankSVM objective function, a pairwise objective function used by the rank loop method in sofia-ml.

* Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Mach. Learn., 46(1-3), 2002. Presents the ROMMA algorithm.

* S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML ’07: Proceedings of the 24th international conference on Machine learning,

sofia-ml's People

Contributors

Stargazers

Watchers

Forkers

anupam-142857 zhiyu-chen gaoliqiang romaryd urbanophile bushalo arnov minghao2016 steenax86 cuiwm djinn zyh1690 mvertes bnigatu afchung cnsuhao mu4bu2 joe-nano mr-sablok duanboomer

sofia-ml's Issues

Multi-Label Passive-Aggressive

Hello D.

I've started to work on the multi-label branch. I have made the following 
changes:

- Parse comma-separated list of labels.

- Add a MultiplePassOuterLoop routine: it shuffles the dataset and makes 
several passes over it. It's more intuitive to determine a number of passes and 
results can sometimes be more stable on some datasets.

- Add a MultiLabelWeightVector. It is compatible with other weight classes 
(both API-wise and file-wise). It also has a bunch of additional methods such 
as "SelectLabel".

- Add Multi-Label Passive-Aggressive. Strictly speaking, the learner optimizes 
a label ranking (relevant labels should be more ranked higher than irrelevant 
labels). On the 20 newsgroup dataset, it gives 82% accuracy (liblinear gave 
85%). (I didn't optimize the hyperparameters though).

- Add a "--prediction_type multi-label" option.

- Infer the number of dimensions from the training dataset when --dimensioality 
is set to 0.


I wanted to add one-vs-all but unfortunately, the fact that the labels are 
attached to the vectors makes it hard (or inefficient): I need to be able to 
pass +1 or -1 instead of the real label to the update function.

Possible short-term plans could include optimizing the multi-class hinge loss 
and the multinomial logistic loss by SGD.

Original issue reported on code.google.com by [email protected] on 28 Apr 2011 at 8:38

Issues with dimensionality off-by-one

What steps will reproduce the problem?
1. Create this training file:

======= train.txt  =======
1 1:1 2:.1 3:.1 200:1                                                           


1 1:1.2 2:.01 3:.01 200:1                                                       


1 1:3 2:.2 3:.41 200:1                                                          


-1 3:4 200:1                                                                    


-1 2:3 200:1                                                                    


-1 1:.1 2:3 3:2 200:1        
====================
2. ./sofia-ml-read-only/sofia-ml --learner_type pegasos --loop_type stochastic 
--lambda 0.1 --iterations 100000 --dimensionality 200 --training_file train.txt 
--model_out debug-model.txt                                                     


3. debug-model.txt has:
-5.01486 -0.169397 -10.0628 -10.0518 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0\
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

The the model should spit out 201 terms, the first being the bias term. Instead 
it spits out 200, and clips off the last weight. When I set dimensionality to 
201, I get what I would expect:

0.263645 0.561799 -0.509116 -0.382012 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 \
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 \
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0.263645  

This was compiled from source a couple weeks ago. The program should probably 
crash if you say dimensionality is 200 and there is a "200:x" term in the 
sparse vector representation, unless the no-bias flag is set.

Original issue reported on code.google.com by [email protected] on 26 Feb 2013 at 3:24

sf-sparse-vector.cc bug, in Init function

Make all_test, then find an error occured white testing sf-sparse-vector_test, 
assertion assert(x1.GetGroupId() == "2"); failed at line 27 of file 
sf-sparse-vector_test.cc.

Solution. Add a line "group_id_c_string[end - position]=0;" in 
sf-sparse-vector.cc line 145. cause string generated by strncpy is not always 
'\0' terminated.

Original issue reported on code.google.com by [email protected] on 5 Nov 2012 at 4:22

k-means question

For the k-means training, does label (in my case, face label) have an influence 
on the clustering?

Original issue reported on code.google.com by [email protected] on 24 Jun 2013 at 1:01

sofia doesnt work on sparse dataset containing lines in which all features are 0

Hi there.

What steps will reproduce the problem?
./sofia-ml --learner_type pegasos --loop_type stochastic --lambda 0.1 
--iterations 10000 --dimensionality 450000 --training_file ../data/m256 
--model_out demo/model


What is the expected output?

What do you see instead?
Reading training data from: ../data/final/catted/train/m256
Segmentation fault (core dumped)

What version of the product are you using? On what operating system?
Ubuntu 13.10 64bit

Please provide any additional information below.

I guess it is because my training data (attached) is so sparse that in some 
lines all features are zero. Can sofia-ml support such dataset? Thank you!

Original issue reported on code.google.com by [email protected] on 18 Mar 2014 at 3:34

Attachments:

m256

sofia-kmeans diverging with increasing number of iterations?

What steps will reproduce the problem?
1. Create 2-dimensional data drawn from 2-dim multivariate Gaussian 
distributions with different means variance = 1. e.g 21 different 
distributions, lets say 1000 draws. Total at 21.000 points. (have tried many 
different variations and does not have any positive effect on the reported 
issue)

2. Train sofia-kmeans with any batch size (tested 500:500:5000) and with any 
number of k clusters (tested 64 128 256) using mini_batch_kmeans with fixed 
random seed.

command line: sofia-kmeans --k 64 --dimensionality 3 --random_seed 124 
--init_type random --opt_type mini_batch_kmeans --mini_batch_size 500 
--iterations 10 --objective_after_init --objective_after_training 
--training_file traindatafile.svmlight --model_out modelfile.sofia

3. Calculate the training error
command line: sofia-kmeans --model_in modelfile.sofia --test_file 
traindatafile.svmlight --objective_on_test --cluster_assignments_out 
trainingassignments.sofia

4. run this in a loop as a function of number of iterations. i ran [1 10 100e3 
500e3 and 1000e3]

What is the expected output? What do you see instead?
I expect that the training error would fall as a function of number of 
iterations used. Since it has fixed seed the random initialization is the same. 
This occurs until 100e3 then it start to diverge. i.e. the training error 
starts increasing dramatically. The training error becomes even larger than the 
random initialization. This is very puzzling to me.

What version of the product are you using? On what operating system?
svn checkout http://sofia-ml.googlecode.com/svn/trunk/sofia-ml 
sofia-ml-read-only
performed 10/3-2015
OS: Ubuntu 14.04

Please provide any additional information below.
Attached is the commands and output from sofia-kmeans (sofia_kmeans.txt) and 
furthermore all model, assignment and datafiles are provided to reproduce these 
finding (tmp.zip)

Original issue reported on code.google.com by [email protected] on 11 Mar 2015 at 12:36

Attachments:

lambda parameter not passed into SvmObjective correctly

in sofia-ml.cc

337       float objective = sofia_ml::SvmObjective(training_data,
338                                          *w,
339                                           CMD_LINE_BOOLS["--lambda"]);

Note that lambda is passed in from CMD_LINE_BOOLS not CMD_LINE_FLOATS which 
results in lambda=0. In TrainModel the correct value of lambda is used:

176   float lambda = CMD_LINE_FLOATS["--lambda"];

Original issue reported on code.google.com by [email protected] on 9 May 2013 at 1:20

malloc error with --hash_mask_bits

Download & build, run the demo commands adding --hash_mask_bits to the 
arguments.  Training proceeds fine, but testing of the model gives the malloc 
error:

$ ./sofia-ml --learner_type pegasos --loop_type stochastic --lambda 0.1 
--iterations 100000 --dimensionality 150000 --training_file demo/demo.train 
--model_out demo/model --hash_mask_bits 8
hash_mask_ 255
Reading training data from: demo/demo.train
Time to read training data: 0.061278
Time to complete training: 52.3639
Writing model to: demo/model
   Done.


$ ./sofia-ml --model_in demo/model --test_file demo/demo.train --results_file 
demo/results.txt --hash_mask_bits 8
hash_mask_ 255
sofia-ml(6235) malloc: *** error for object 0x800000: pointer being freed was 
not allocated
*** set a breakpoint in malloc_error_break to debug
Reading model from: demo/model
   Done.
Reading test data from: demo/demo.train
Time to read test data: 0.06114
Time to make test prediction results: 0.008274
Writing test results to: demo/results.txt
   Done.


========

$ g++ --version
i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5659)

Original issue reported on code.google.com by [email protected] on 18 Jun 2010 at 6:43

Multi-label classification

Is there any example in sofia-ml for multilabel classification?

Original issue reported on code.google.com by [email protected] on 20 Jan 2015 at 5:34

sf-weight-vector fails unit test

What steps will reproduce the problem?
1. make all_tests

What is the expected output?

PASS.

What do you see instead?

sf-weight-vector_test: sf-weight-vector_test.cc:95: int main(int, char**): 
Assertion `w_6.ValueOf(3) == 1' failed.

What version of the product are you using? On what operating system?

Latest sophia-ml from svn, Debian 5, GCC 4.3.2.

Original issue reported on code.google.com by [email protected] on 14 Feb 2010 at 3:22

The ids in cluster output are formatted scientific notation rather than ints


This effects large ids.

The issue is in cluster-src/sofia-kmeans.cc

The solution diff is:

345c345
<             << test_data->VectorAt(i).GetY() << std::endl;

---
>             << (int)test_data->VectorAt(i).GetY() << std::endl;

Original issue reported on code.google.com by [email protected] on 6 Mar 2013 at 2:37

Assertion '!cluster_centers_empty()' fails and crashes program

What steps will reproduce the problem?
cd "sofia-ml-read-only"

./sofia-kmeans --k 100 --init_type random --opt_type mini_batch_kmeans 
--mini_batch_size 100 --iterations 1000 --cluster_mapping_type rbf_kernel 
--test_file <test file location goes here> --cluster_mapping_out <cluster 
mapping output location goes here>

What is the expected output? What do you see instead?

The expected output is a cluster mapping text file. Instead, I see:

cd "sofia-ml-read-only"

sofia-kmeans: sf-cluster-centers.cc:93: float 
SfClusterCenters::SqDistanceToClosestCenter(const SfSparseVector&, int*) const: 
Assertion `!cluster_centers_.empty()' failed.

What version of the product are you using? On what operating system?

I don't know where to find the product version. The most recent version is the 
one I have been using.
Operating system: Ubuntu 12.04.5 LTS

Please provide any additional information below.

N/A

Original issue reported on code.google.com by [email protected] on 23 Sep 2014 at 6:46

problems building sofia-ml

What steps will reproduce the problem?
1. run make with gcc version 4.4.3 20100127 (Red Hat 4.4.3-4) (GCC)

What is the expected output? What do you see instead?

a proper build

What version of the product are you using? On what operating system?

trunk on 2010-03-30 15:52

Please provide any additional information below.

gcc output:

:sofia-ml-read-only/src$ make
g++ -O3 -lm -Wall -o sofia-ml sofia-ml.cc sofia-ml-methods.cc
sf-weight-vector.cc sf-sparse-vector.cc sf-data-set.cc
sf-hash-weight-vector.cc sf-hash-inline.cc
sf-sparse-vector.cc: In member function âvoid SfSparseVector::Init(const
char*)â:
sf-sparse-vector.cc:132: error: âsscanfâ was not declared in this scope
sf-hash-weight-vector.cc: In constructor
âSfHashWeightVector::SfHashWeightVector(int)â:
sf-hash-weight-vector.cc:40: error: âexitâ was not declared in this scope
sf-hash-weight-vector.cc: In constructor
âSfHashWeightVector::SfHashWeightVector(int, const std::string&)â:
sf-hash-weight-vector.cc:54: error: âexitâ was not declared in this scope
sf-hash-weight-vector.cc: In member function âvirtual void
SfHashWeightVector::AddVector(const SfSparseVector&, float)â:
sf-hash-weight-vector.cc:96: error: âexitâ was not declared in this scope
sf-hash-weight-vector.cc:111: error: âexitâ was not declared in this scope
make: *** [sofia-ml] Error 1

Original issue reported on code.google.com by [email protected] on 30 Mar 2010 at 1:55

Various errors in source code

There is a problem with the source code. Many files forget to include standard 
libraries, and some of the assertions in the Unit tests fail.

What steps will reproduce the problem?
1. Follow the instructions on https://code.google.com/p/sofia-ml/
2. Run make all_test in src/

What is the expected output? What do you see instead?
I see lots of compile time errors.

What version of the product are you using? On what operating system?
Ubuntu 14.04, G++ 4.7, sofia-ml

Please provide any additional information below.
The following updates fixed everything for me:
sf-sparse-vector_test.cc
l27   //assert(x1.GetGroupId() == "2");
l75   //assert(x6.GetGroupId() == "3");

simple-cmd-line-helper.h
l68 #include <cstdlib>
l69 #include <stdio.h>

sofia-ml-methods_test.cc
l19 #include <cstdlib>

Original issue reported on code.google.com by [email protected] on 16 Jul 2014 at 4:55

Assertion failure in sf-kmeans-methods_test

What steps will reproduce the problem?
1. Grab source from SVN.
2. cd cluster-src/
3. make all_test

What is the expected output? What do you see instead?

Test fails with:

    sf-kmeans-methods_test: sf-kmeans-methods_test.cc:50: int main(int, char**):
    Assertion `cluster_centers_3->ClusterCenter(0).ValueOf(1) == 1.0' failed.

Adding some debug just before the assert failure resulting in:

    cluster_centers_3->ClusterCenter(0).ValueOf(1) : 0 (should be 1.0)

What version of the product are you using? On what operating system?

SVN version:

    r25 | [email protected] | 2010-04-28 04:52:54 +1000 (Wed, 28 Apr 2010) | 1 line

Running on x86_64 linux with gcc version 4.7.2 (Debian 4.7.2-5).

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 17 Feb 2013 at 12:28

make all_test : Assertion `x1.GetGroupId() == "2"' failed

What steps will reproduce the problem?

Follow the instructions in the README "Quick Start" section and on (on Ubuntu 
14.04)
1. svn checkout http://sofia-ml.googlecode.com/svn/trunk/ sofia-ml-read-only
2. cd sofia-ml-read-only/src
3. make clean
4. make all_test

What is the expected output? What do you see instead?

Expecting success in all tests

Seeing instead:
Test is failing immediately with:

g++ -O3 -lm -Wall -o sf-sparse-vector_test sf-sparse-vector_test.cc 
sf-sparse-vector.cc
./sf-sparse-vector_test
sf-sparse-vector_test: sf-sparse-vector_test.cc:27: int main(int, char**): 
Assertion `x1.GetGroupId() == "2"' failed.
make: *** [sf-sparse-vector_test] Aborted (core dumped)
make: *** Deleting file `sf-sparse-vector_test'

What version of the product are you using? On what operating system?

Latest source:
  r31 | [email protected] | 2010-07-26 14:17:11 -0700 (Mon, 26 Jul 2010) | 1 line

On Ubuntu 14.04 (LTS)

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 4 May 2015 at 9:22

simple fix for gcc 4.3

What steps will reproduce the problem?
1. make src folder with gcc version 4.3

Adding ...
#include <cstring> 
#include <cstdlib>
to the top of sf-sparse-vector.cc file fixed this problem for me.

Go Jumbos.

Original issue reported on code.google.com by [email protected] on 24 Jan 2010 at 11:42

Training Data Format and Class Label for kmeans

Hi,

I have changed my training data into sparse data format you mentioned.
./sofia-kmeans --k 1000 --init_type random --opt_type batch_kmeans --iterations 
1000 --objective_after_init --training_file demo/SMLFAutoTrain1s512val.txt 
--model_out demo/CSMLFAutoTrain1s512val.txt
However, I am getting the following errors:
Reading data from: demo/SMLFAutoTrain1s512val.txt
Error reading file demo/SMLFAutoTrain1s512val.txt
I opened your demo.train, I saw that you have square box at the end of every 
vector. How can I changed my data format to yours since the square box at the 
end may not be the only one? I tried to fetch your demo.train file in matlab, 
and it doesn't let me do that either.

For the example of kmeans:
> ./sofia-kmeans --k 5 --init_type random --opt_type mini_batch_kmeans 
--mini_batch_size 100 --iterations 500 --objective_after_init 
--objective_after_training --training_file demo/demo.train --model_out 
demo/clusters.txt
the above command will return the five centroid location, right?
In this case, since only producing the 5 cluster center location, the class 
label in the training data (demo.train) can be assigned with any values, right? 
Of course, I chose, say, all 1 among these values: 1,0,-1.

I look forward to your clarification. 

Thank you,


Fred

Original issue reported on code.google.com by [email protected] on 23 Sep 2011 at 3:56

Attachments:

SMLFAutoTrain1s512val.txt

Does sofia-ml support other values for features than "1"?

I noticed in the demo that all the features have a value of "1".  Does sofia-ml 
support and/or make use of higher integer values (like for # of times a word is 
seen in a document) or for floating point numbers?

Original issue reported on code.google.com by [email protected] on 25 Feb 2013 at 4:34

glycerine / sofia-ml Goto Github PK

sofia-ml's Introduction

sofia-ml's People

Contributors

Stargazers

Watchers

Forkers

sofia-ml's Issues

Recommend Projects

Recommend Topics

Recommend Org