openml / openml-r Goto Github PK

R package to interface with OpenML

Home Page: http://openml.github.io/openml-r/

License: Other

R 16.40% Jupyter Notebook 26.57% TeX 5.95% HTML 51.08%

openml cran r dataset database machine-learning reproducible-research open-data data-science datasets opendata open-science openscience machine-learning-algorithms statistics classification regression benchmarking benchmarking-suite arff

openml-r's Introduction

R interface to OpenML.org

OpenML.org is an online machine learning platform where researchers can access open data, download and upload data sets, share their machine learning tasks and experiments and organize them online to work and collaborate with other researchers. The R interface allows to query for data sets with specific properties, and allows the downloading and uploading of data sets, tasks, flows and runs.

For more information, have a look at our

Deprecated

This package relies on the mlr framework, which is now retired in favor of the newer mlr3 framework. While you can still use this package with mlr or to access information from OpenML, we recommend transitioning to the mlr3 framework and use the related mlr3oml package.

How to cite

To cite the OpenML R package in publications, please use our paper entitled OpenML: An R Package to Connect to the Machine Learning Platform OpenML [BibTex]

See also here how to cite the OpenML project itself.

Installation of the package

Install the stable version from CRAN

install.packages("OpenML")

Install the development version from GitHub (using devtools)

devtools::install_github("openml/openml-r")

Furthermore, you need farff installed to process ARFF files:

install.packages("farff")

Alternatively you can make use of the RWeka R package to process ARFF files. However, in particular for larger ARFF files, farff is considerably faster than RWeka.

Contact

Found some nasty bugs? Please use the issue tracker to report on bugs or missing features. Pay attention to explain the problem as good as possible (in the best case with a traceback() result and a sessionInfo()). Moreover, a reproducible example is desirable.

openml-r's People

Contributors

Stargazers

Watchers

openml-r's Issues

Encoding problem when downloading OpenML data set

I wanted to create a mlr task from a data set that I just uploaded to OpenML. Unfortunately, there seems to be a problem with some characters in the data set's description!?

Do I have to change the description or is there another way to fix this?

R Log:
Downloading data set 'libras_move' from OpenML repository.
Intermediate files (XML and ARFF) will be stored in : C:\Users\tob\AppData\Local\Temp\RtmpU183Lr
Downloading file: C:\Users\tob\AppData\Local\Temp\RtmpU183Lr/data_set_desc_libras_move_v1.xml from:
http://openml.org/api/?f=openml.data.description&data.id=299
Input is not proper UTF-8, indicate encoding !
Bytes: 0xED 0x73 0x63 0x61
Error : 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xED 0x73 0x63 0x61

Error in parseXMLResponse(file, "Getting data set description", "data_set_description") :
Error in parsing XML for type data_set_description in file: C:\Users\tob\AppData\Local\Temp\RtmpU183Lr/data_set_desc_libras_move_v1.xml
All intermediate XML and ARFF files are now removed.

traceback()
5: stop(obj)
4: stopf("Error in parsing XML for type %s in file: %s", type, file)
3: parseXMLResponse(file, "Getting data set description", "data_set_description")
2: parseOpenMLDataSetDescription(file = fn.data.set.desc)
1: downloadOpenMLDataAsMlrTask("libras_move", clean.up = TRUE)

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] ROCR_1.0-5 gplots_2.13.0 plyr_1.8.1 cluster_1.15.2 BatchJobs_1.4 R.matlab_3.0.1
[7] fail_1.2 OpenML_1.0 DMwR_0.4.1 lattice_0.20-29 mlr_2.2 BBmisc_1.7
[13] ParamHelpers_1.3 devtools_1.5

loaded via a namespace (and not attached):
[1] abind_1.4-0 bitops_1.0-6 brew_1.0-6 caTools_1.17 checkmate_1.2
[6] class_7.3-10 codetools_0.2-8 DBI_0.2-7 digest_0.6.4 evaluate_0.5.5
[11] gdata_2.13.3 gtools_3.4.1 httr_0.3 KernSmooth_2.23-12 memoise_0.2.1
[16] parallel_3.1.0 parallelMap_1.1 quantmod_0.4-0 R.methodsS3_1.6.1 R.oo_1.18.0
[21] R.utils_1.32.4 Rcpp_0.11.2 RCurl_1.95-4.1 rJava_0.9-6 rjson_0.2.14
[26] rpart_4.1-8 RSQLite_0.11.4 RWeka_0.4-23 RWekajars_3.7.11-1 sendmailR_1.1-2
[31] splines_3.1.0 stringr_0.6.2 survival_2.37-7 tools_3.1.0 whisker_0.3-2
[36] XML_3.98-1.1 xts_0.9-7 zoo_1.7-11

Add Travis CI support

Travis is a Continuous Integration platform. It's quite handy for automatic build testing. More information at https://travis-ci.org/

Use knitr for tutorial.

Talk to J.R. for this.

check that regr. data sets are displayed in reasonable way in getDataQualities

Fill "dependencies" field when uploading an implementation

Include the mlr version that was used to generate the implementation.

My suggestion would be to include sessionInfo() output, so all the packages and their versions are included and therefore facilitates reproducibility.

read.arff dislikes some openml datasets

At least these ones do not work:

blacklist = c("baseball","cmc","hypothyroid","mfeat-factors","sick","spambase","mushroom","page-blocks")

May be change foreign::read.arff() to RWeka::read.arff()

check makeRunParameterList

DataQualities: NumberOfSymbolicFeatures is always NA

I guess this a problem on the R side. Need to check.

have you got this error before?

openML.impl <- createOpenMLImplementationForMLRLearner(learner)

I got:

Error in createOpenMLImplementationForMLRLearner(learner) :
could not find function "OpenMLImplementation"

How to deal with server interacting R code in the tutorial?

We'd like to run code chunks and print the output in the tutorial, so the user sees what happens. However, most of the code interacts with the server. Therefore, a valid hash must be supplied. How can we solve this problem? Right now, the chunks are not evaluated, so we have no output.

Maybe we should add a special openML-account for this (e.g., for Mr. "John Sample")? But then, users will know its login data and could "abuse" the account.

are downloading / cachedir threadsafe?

Assume you download stuff on a cluster. Now the if the cachedir of diffferent R processes points to the same physical dir, we might have a problem right?

A guess we dont want to implment locks?
Or does R support that?

Does that imply parallel jobs cant use the same cache?

This seems like a complicated topic, whatever we decide we should document it carefully....

Modifying the `config` file interactively (within a R session)?

setOMLConfig just sets the configuration for the current R session and does not modify the config file. Wouldn't it be better to replace the config file with the parameters set by setOMLConfig for a global configuration? Or is setOMLConfig only thought as session-specific configuration?
This is rather less importand but after modifying the config file manually, the package has to be detached and loaded again to get the modified changes from the config file. What if we extend setOMLConfig so that the default call setOMLConfig(conf=list()) always reloads the config file and sets the configuration specified in the config file?

RWeka::read.arff broken

If you provide a path relative to your home ("~/.openml/cache/..."), you get a nice java IO error...

zero based predictions

The predictions upload should be one based as it is when the user downloads a task.

slot and data frame names of objects

Names should be the same as on the server. the only difference should be . instead of _.

exceptions only in VERY RARE cases

check current objects for that rule.

Better password handling

At the moment, the password in the config file or console is in plain text. Find a better solution. Maybe we should consider a password prompt as shown here:
http://stackoverflow.com/questions/13033573/how-to-handle-db-passwords-in-r-connection-strings

`getOMLRun` returns `NA`s for the `did` column in the `output.data` slot

I am not sure why we need the did column for the evaluations in the output.data slot of a OMLRun object:

run = getOMLRun(run.id = 1L)
run$output.data$evaluations$did
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

I have checked the did column for the first 100 run.ids and each of them return a vector of NAs, so do we really need this column?

I made a similar observation for run$output.data$evaluations$label, run$output.data$evaluations$sample.size and run$output.data$evaluations$stdev. Can we remove these columns when they contain only NAs? Or can we even remove them completely?

Wrong error message if authenticate with wrong passwword

Authenticating user at server: [email protected]
[1] "http://www.openml.org/api/?f=openml.authenticate"
Error: Internal Server Error

We need informative msg

getMetaLearningFeatures now working: getDataQualities not found

> getMetaLearningFeatures()
SQL was processed: 28 rows selected. 
Fehler in getMetaLearningFeatures() : 
  konnte Funktion "getDataQualities" nicht finden

see here

Use makeR.

`getOMLConfig` repeated class assignment

The class "OMLConfig" is assigned each time when one runs getOMLConfig().

Example: Running this code for (i in 1:10) cat(length(class(getOMLConfig())), fill=TRUE) should print always the same length, currently the length increases.

Tutorial

Extend the current pages slightly somewhat.
Add some general info on OpenML, but always link if info is already there.
Show downloading of performance info.

downloadOpenMLDataAsMlrTask

Needs to be changed, there is some ::: stuff in there.
Do we really want OpenML data sets as mlr tasks? OpenML tasks should become mlr tasks!
Document

Make the code work for now, but discuss whether this is reasonable then!

bug in data.desc printer

ot = downloadOpenMLTask(3008)
Downloading task 3008 from OpenML repository.
Intermediate files (XML and ARFF) will be stored in : /tmp/Rtmpm70wOk
Downloading file: /tmp/Rtmpm70wOk/task.xml from:
http://openml.org/api/?f=openml.tasks.search&task_id=3008
Downloading file: /tmp/Rtmpm70wOk/data_set_description.xml from:
http://openml.org/api/?f=openml.data.description&data.id=299
Downloading file: /tmp/Rtmpm70wOk/data_set.ARFF from:
http://openml.org/files/download/52200/libras_move.arff
Downloading file: /tmp/Rtmpm70wOk/data_splits.ARFF from:
http://openml.org//api_splits/get/3008/Task_3008_splits.arff
All intermediate XML and ARFF files are now removed.
ot$
ot$id ot$pars ot$data.desc.id ot$estimation.procedure ot$evaluation.measures
ot$type ot$target.features ot$data.desc ot$preds
ot$data.desc

Dataset libras_move :: (openML ID = 299, version = 1)

    Collection Date  : 2009
    Upload Date      : 2014-08-20 20:56:22
    Licence          : Public
    URL              : http://openml.org/files/download/52200/libras_move.arff

Error in if (x$language != "") catf("\tLanguage : %s", x$language) :
missing value where TRUE/FALSE needed

Add function to simply download a data set

Like downloadOpenMLDataAsMlrTask, but do not convert to mlr, but return the raw results.

Then remove downloadOpenMLDataAsMlrTask, and write a converter for that result object to mlr.

measure time and also use rscimark to upload info of pc

measure time of each train test split, for training and prediction separately.

also do one final model fit for the whole data, predict on whole data. throw results away.

depending on how large the model is, upload this as well. also in human readable form.

also upload rscimark vector, once per machine, cache this.

There is no README.md

and thus the tutorial is somewhat hard to find.

for list* functions we should subset to status "active" by default

This should be an option to subset the results to the given "stati".

default should be "active". in 99% of cases you dont want the other crap

"character vectors are no longer accepted by unserialize()"

We use unserialize to retrieve complex mlr learner objects from source files of OpenML flows. This is not working anymore. Try the following for example:

flow = getOMLFlow(1057)
source(flow$source.path)
sourcedFlow(task.id = 1L)

use new roxygen structure for test dir

How to upload a dataset?

wasn't able to find a specific function to upload datasets to openml, is there any available or need to make a query?

our wiki is gone!

@dominikkirchhoff @giuseppec

Please sort this out ASAP.

Also please write down how one changes the wiki and works on it, so new people like @giuseppec know how to work on it.

Giuseppe is emailing tomorrow.

Configuration not documented

Link in ?configuration points to BatchJobs. We also need some documentation in the package, i.e. location and format of the configuration file.

Options set in Rprofile get overwritten on package load

java problem

not sure why, but I'm having a problem with java. On
devtools::install_github("openml/r")
I get

installing source package ‘OpenML’ ...
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
No Java runtime present, requesting install.

I just installed Java (v8), and
from the terminal it seems OK:

stephens$ java -version
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

any ideas?

Include stacked learners

In uploadMlrLearner, createOpenMLImplementationForMlrLearner, (etc.)

document the OMLDataSet Class

Do an extra roxygen section / file (without real code) OMLDataSet.R

Document everything which is in the class. For the "data" and "desc" try to link the global API Docs.
(otherwise too much to write down)

toMLR: Why do we ignore the given splitting?

Here we create a resample description although an openMLtask already contains a given Data Split so we should create a resample instance directly.

> task.openML$estimation.procedure

Estimation Method :: holdout
Parameters         ::
    number_repeats = 1
    number_folds = NA
    percentage = 33
    stratified_sampling = true
Data Splits        :: 
'data.frame':   2000 obs. of  4 variables:
 $ type : Factor w/ 2 levels "TRAIN","TEST": 2 2 2 2 2 2 2 2 2 2 ...
 $ rowid: int  475 904 1484 751 1253 1024 1356 1376 389 10 ...
 $ rep  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ fold : num  1 1 1 1 1 1 1 1 1 1 ...

dots in feature names

Hi, I uploaded this dataset: http://openml.liacs.nl/d/345
and noticed there are periods in the feature names. Is this still an issue in r? AFAI can remember we agreed that periods are quite common and will be handled on the r side?

Just checking :)

Show how a basic ML research question can be answered with OML and mlr

Brainstorm a few ideas.

Here is a simple one: In Bagging and random forests there is a difference about
a) when you randomly select the festures. In the tree nodes (RF) or the for the whole tree (B)
b) whether you prune the tree (B: maybe) or not (RF)

Now is there a big difference in performance? If yes which of the 2 is more important?
Yes, one could probably also look at Breimans paper for this but why not demonstrate it via a nice OMLR study?

There probably dozens of other interesting similar questions.

batch experiments errors

Resampling of the first 100 tasks with all adequate mlr learners.
Submitted: 1710 (100%)
Done: 1692 (98.95%)
Expired: 18 (1.05%)

There were 295 errors (17%). Here are some common error messages:

[114 times] Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

Quite clearly, this is due to "pseudo factors" that have only one level. I guess, we should remove those automatically in runTask().

[24 times] Error in knn.cv(x, cl, k) : NA/NaN/Inf in foreign function call (arg 6)
[21 times] Error in scale.default(newdata[, scaling(object)$scaled, drop = FALSE], : length of 'center' must equal the number of columns of 'x'
[19 times] Error in qda.default(x, grouping, ...) : rank deficiency in group [...].

If you want me to submit some of the rarer errors, too, or post more details on those above, please tell me.

dependencies could be handled better

> library(openML)
> data.chars <- getDataCharacteristics()
Fehler in openMLSQLQuery(SQL.query) : 
  konnte Funktion "str_replace_all" nicht finden
> library(stringr)
> data.chars <- getDataCharacteristics()
Fehler in openMLSQLQuery(SQL.query) : 
  konnte Funktion "fromJSON" nicht finden
> library(rjson)
> data.chars <- getDataCharacteristics()
SQL was processed: 130 rows selected.

downloading of source/binary files

Hey,

at the moment, in getOMLFlow we do not download source/binary files at all. How should the function behave?

where to save the files (cache?)
always download all, or filter by file format or something like that?

As an "only R"-User, I'd not be interested in anything other than R-files, so a filter would be nice, I guess.

Authentication troubles

it is currently required for all actions to authenticate, this is nonsense?
the session hash should be stored in and retrieved from options (if not directly provided) after first call to authenticateUser.

S4 classes are not properly exported

library(openML)
OpenMLImplementation()
Fehler: Objekt 'OpenMLImplementation' nicht gefunden

create multiple mlr tasks from OpenML data sets

It seems that downloadOpenMLDataAsMlrTask() can only be used once per R Session, because some files in the temporary data folder can't be overwritten.

Here's what I've done:
task1 = downloadOpenMLDataAsMlrTask("ozone_level")
-> works fine

task2 = downloadOpenMLDataAsMlrTask("abalone")
-> Error

Error Message:
Downloading data set 'abalone' from OpenML repository.
Intermediate files (XML and ARFF) will be stored in : C:\Users\tob\AppData\Local\Temp\Rtmp6D2vf4
Downloading file: C:\Users\tob\AppData\Local\Temp\Rtmp6D2vf4/data_set_description.xml from:
http://openml.org/api/?f=openml.data.description&data.id=183
Error in downloadOpenMLDataAsMlrTask("abalone") :
Assertion on 'file' failed: File at path already exists: 'C:\Users\tob\AppData\Local\Temp\Rtmp6D2vf4\data_set.arff'
All intermediate XML and ARFF files are now removed.

Of course, it's possible to change the directory via the dir-option, but for the download of multiple data sets it would be nice just to use the same temporary path over and over.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] OpenML_1.0 devtools_1.5
loaded via a namespace (and not attached):
[1] BBmisc_1.7 checkmate_1.2 codetools_0.2-8 digest_0.6.4 evaluate_0.5.5
[6] grid_3.1.0 httr_0.3 memoise_0.2.1 mlr_2.1 parallel_3.1.0
[11] parallelMap_1.1 ParamHelpers_1.3 plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1
[16] rJava_0.9-6 rjson_0.2.14 RWeka_0.4-23 RWekajars_3.7.11-1 splines_3.1.0
[21] stringr_0.6.2 survival_2.37-7 tools_3.1.0 whisker_0.3-2 XML_3.98-1.1