Git Product home page Git Product logo

openml / openml-r Goto Github PK

View Code? Open in Web Editor NEW
95.0 18.0 37.0 4.28 MB

R package to interface with OpenML

Home Page: http://openml.github.io/openml-r/

License: Other

R 16.40% Jupyter Notebook 26.57% TeX 5.95% HTML 51.08%
openml cran r dataset database machine-learning reproducible-research open-data data-science datasets opendata open-science openscience machine-learning-algorithms statistics classification regression benchmarking benchmarking-suite arff

openml-r's Introduction

R interface to OpenML.org

License Rdoc CRAN Status Badge

R-CMD-check

CRAN Downloads codecov

OpenML.org is an online machine learning platform where researchers can access open data, download and upload data sets, share their machine learning tasks and experiments and organize them online to work and collaborate with other researchers. The R interface allows to query for data sets with specific properties, and allows the downloading and uploading of data sets, tasks, flows and runs.

For more information, have a look at our

Deprecated

This package relies on the mlr framework, which is now retired in favor of the newer mlr3 framework. While you can still use this package with mlr or to access information from OpenML, we recommend transitioning to the mlr3 framework and use the related mlr3oml package.

How to cite

To cite the OpenML R package in publications, please use our paper entitled OpenML: An R Package to Connect to the Machine Learning Platform OpenML [BibTex]

See also here how to cite the OpenML project itself.

Installation of the package

  • Install the stable version from CRAN
install.packages("OpenML")

or

  • Install the development version from GitHub (using devtools)
devtools::install_github("openml/openml-r")

Furthermore, you need farff installed to process ARFF files:

install.packages("farff")

Alternatively you can make use of the RWeka R package to process ARFF files. However, in particular for larger ARFF files, farff is considerably faster than RWeka.

Contact

Found some nasty bugs? Please use the issue tracker to report on bugs or missing features. Pay attention to explain the problem as good as possible (in the best case with a traceback() result and a sessionInfo()). Moreover, a reproducible example is desirable.

openml-r's People

Contributors

bakopyan avatar berndbischl avatar dominikkirchhoff avatar gitter-badger avatar giuseppec avatar hfrick avatar hofnerb avatar ja-thomas avatar jakob-r avatar jakobbossek avatar joaquinvanschoren avatar jonmcalder avatar kerschke avatar mixacom avatar mllg avatar quepas avatar sebffischer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openml-r's Issues

Encoding problem when downloading OpenML data set

I wanted to create a mlr task from a data set that I just uploaded to OpenML. Unfortunately, there seems to be a problem with some characters in the data set's description!?

Do I have to change the description or is there another way to fix this?

R Log:
Downloading data set 'libras_move' from OpenML repository.
Intermediate files (XML and ARFF) will be stored in : C:\Users\tob\AppData\Local\Temp\RtmpU183Lr
Downloading file: C:\Users\tob\AppData\Local\Temp\RtmpU183Lr/data_set_desc_libras_move_v1.xml from:
http://openml.org/api/?f=openml.data.description&data.id=299
Input is not proper UTF-8, indicate encoding !
Bytes: 0xED 0x73 0x63 0x61
Error : 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xED 0x73 0x63 0x61

Error in parseXMLResponse(file, "Getting data set description", "data_set_description") :
Error in parsing XML for type data_set_description in file: C:\Users\tob\AppData\Local\Temp\RtmpU183Lr/data_set_desc_libras_move_v1.xml
All intermediate XML and ARFF files are now removed.

traceback()
5: stop(obj)
4: stopf("Error in parsing XML for type %s in file: %s", type, file)
3: parseXMLResponse(file, "Getting data set description", "data_set_description")
2: parseOpenMLDataSetDescription(file = fn.data.set.desc)
1: downloadOpenMLDataAsMlrTask("libras_move", clean.up = TRUE)

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] ROCR_1.0-5 gplots_2.13.0 plyr_1.8.1 cluster_1.15.2 BatchJobs_1.4 R.matlab_3.0.1
[7] fail_1.2 OpenML_1.0 DMwR_0.4.1 lattice_0.20-29 mlr_2.2 BBmisc_1.7
[13] ParamHelpers_1.3 devtools_1.5

loaded via a namespace (and not attached):
[1] abind_1.4-0 bitops_1.0-6 brew_1.0-6 caTools_1.17 checkmate_1.2
[6] class_7.3-10 codetools_0.2-8 DBI_0.2-7 digest_0.6.4 evaluate_0.5.5
[11] gdata_2.13.3 gtools_3.4.1 httr_0.3 KernSmooth_2.23-12 memoise_0.2.1
[16] parallel_3.1.0 parallelMap_1.1 quantmod_0.4-0 R.methodsS3_1.6.1 R.oo_1.18.0
[21] R.utils_1.32.4 Rcpp_0.11.2 RCurl_1.95-4.1 rJava_0.9-6 rjson_0.2.14
[26] rpart_4.1-8 RSQLite_0.11.4 RWeka_0.4-23 RWekajars_3.7.11-1 sendmailR_1.1-2
[31] splines_3.1.0 stringr_0.6.2 survival_2.37-7 tools_3.1.0 whisker_0.3-2
[36] XML_3.98-1.1 xts_0.9-7 zoo_1.7-11

read.arff dislikes some openml datasets

At least these ones do not work:

blacklist = c("baseball","cmc","hypothyroid","mfeat-factors","sick","spambase","mushroom","page-blocks")

May be change foreign::read.arff() to RWeka::read.arff()

have you got this error before?

openML.impl <- createOpenMLImplementationForMLRLearner(learner)

I got:

Error in createOpenMLImplementationForMLRLearner(learner) :
could not find function "OpenMLImplementation"

How to deal with server interacting R code in the tutorial?

We'd like to run code chunks and print the output in the tutorial, so the user sees what happens. However, most of the code interacts with the server. Therefore, a valid hash must be supplied. How can we solve this problem? Right now, the chunks are not evaluated, so we have no output.

Maybe we should add a special openML-account for this (e.g., for Mr. "John Sample")? But then, users will know its login data and could "abuse" the account.

are downloading / cachedir threadsafe?

Assume you download stuff on a cluster. Now the if the cachedir of diffferent R processes points to the same physical dir, we might have a problem right?

A guess we dont want to implment locks?
Or does R support that?

Does that imply parallel jobs cant use the same cache?

This seems like a complicated topic, whatever we decide we should document it carefully....

Modifying the `config` file interactively (within a R session)?

  1. setOMLConfig just sets the configuration for the current R session and does not modify the config file. Wouldn't it be better to replace the config file with the parameters set by setOMLConfig for a global configuration? Or is setOMLConfig only thought as session-specific configuration?

  2. This is rather less importand but after modifying the config file manually, the package has to be detached and loaded again to get the modified changes from the config file. What if we extend setOMLConfig so that the default call setOMLConfig(conf=list()) always reloads the config file and sets the configuration specified in the config file?

RWeka::read.arff broken

If you provide a path relative to your home ("~/.openml/cache/..."), you get a nice java IO error...

slot and data frame names of objects

Names should be the same as on the server. the only difference should be . instead of _.

exceptions only in VERY RARE cases

check current objects for that rule.

`getOMLRun` returns `NA`s for the `did` column in the `output.data` slot

I am not sure why we need the did column for the evaluations in the output.data slot of a OMLRun object:

run = getOMLRun(run.id = 1L)
run$output.data$evaluations$did
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

I have checked the did column for the first 100 run.ids and each of them return a vector of NAs, so do we really need this column?

I made a similar observation for run$output.data$evaluations$label, run$output.data$evaluations$sample.size and run$output.data$evaluations$stdev. Can we remove these columns when they contain only NAs? Or can we even remove them completely?

`getOMLConfig` repeated class assignment

The class "OMLConfig" is assigned each time when one runs getOMLConfig().

Example: Running this code for (i in 1:10) cat(length(class(getOMLConfig())), fill=TRUE) should print always the same length, currently the length increases.

Tutorial

  1. Extend the current pages slightly somewhat.

  2. Add some general info on OpenML, but always link if info is already there.

  3. Show downloading of performance info.

downloadOpenMLDataAsMlrTask

  • Needs to be changed, there is some ::: stuff in there.
  • Do we really want OpenML data sets as mlr tasks? OpenML tasks should become mlr tasks!
  • Document

Make the code work for now, but discuss whether this is reasonable then!

bug in data.desc printer

ot = downloadOpenMLTask(3008)
Downloading task 3008 from OpenML repository.
Intermediate files (XML and ARFF) will be stored in : /tmp/Rtmpm70wOk
Downloading file: /tmp/Rtmpm70wOk/task.xml from:
http://openml.org/api/?f=openml.tasks.search&task_id=3008
Downloading file: /tmp/Rtmpm70wOk/data_set_description.xml from:
http://openml.org/api/?f=openml.data.description&data.id=299
Downloading file: /tmp/Rtmpm70wOk/data_set.ARFF from:
http://openml.org/files/download/52200/libras_move.arff
Downloading file: /tmp/Rtmpm70wOk/data_splits.ARFF from:
http://openml.org//api_splits/get/3008/Task_3008_splits.arff
All intermediate XML and ARFF files are now removed.
ot$
ot$id ot$pars ot$data.desc.id ot$estimation.procedure ot$evaluation.measures
ot$type ot$target.features ot$data.desc ot$preds
ot$data.desc

Dataset libras_move :: (openML ID = 299, version = 1)

    Collection Date  : 2009
    Upload Date      : 2014-08-20 20:56:22
    Licence          : Public
    URL              : http://openml.org/files/download/52200/libras_move.arff

Error in if (x$language != "") catf("\tLanguage : %s", x$language) :
missing value where TRUE/FALSE needed

Add function to simply download a data set

Like downloadOpenMLDataAsMlrTask, but do not convert to mlr, but return the raw results.

Then remove downloadOpenMLDataAsMlrTask, and write a converter for that result object to mlr.

measure time and also use rscimark to upload info of pc

measure time of each train test split, for training and prediction separately.

also do one final model fit for the whole data, predict on whole data. throw results away.

depending on how large the model is, upload this as well. also in human readable form.

also upload rscimark vector, once per machine, cache this.

How to upload a dataset?

wasn't able to find a specific function to upload datasets to openml, is there any available or need to make a query?

Configuration not documented

Link in ?configuration points to BatchJobs. We also need some documentation in the package, i.e. location and format of the configuration file.

java problem

not sure why, but I'm having a problem with java. On
devtools::install_github("openml/r")
I get

  • installing source package ‘OpenML’ ...
    ** R
    ** inst
    ** tests
    ** byte-compile and prepare package for lazy loading
    No Java runtime present, requesting install.

I just installed Java (v8), and
from the terminal it seems OK:

stephens$ java -version
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

any ideas?

document the OMLDataSet Class

Do an extra roxygen section / file (without real code) OMLDataSet.R

Document everything which is in the class. For the "data" and "desc" try to link the global API Docs.
(otherwise too much to write down)

toMLR: Why do we ignore the given splitting?

Here we create a resample description although an openMLtask already contains a given Data Split so we should create a resample instance directly.

> task.openML$estimation.procedure

Estimation Method :: holdout
Parameters         ::
    number_repeats = 1
    number_folds = NA
    percentage = 33
    stratified_sampling = true
Data Splits        :: 
'data.frame':   2000 obs. of  4 variables:
 $ type : Factor w/ 2 levels "TRAIN","TEST": 2 2 2 2 2 2 2 2 2 2 ...
 $ rowid: int  475 904 1484 751 1253 1024 1356 1376 389 10 ...
 $ rep  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ fold : num  1 1 1 1 1 1 1 1 1 1 ...

dots in feature names

Hi, I uploaded this dataset: http://openml.liacs.nl/d/345
and noticed there are periods in the feature names. Is this still an issue in r? AFAI can remember we agreed that periods are quite common and will be handled on the r side?

Just checking :)

Show how a basic ML research question can be answered with OML and mlr

Brainstorm a few ideas.

Here is a simple one: In Bagging and random forests there is a difference about
a) when you randomly select the festures. In the tree nodes (RF) or the for the whole tree (B)
b) whether you prune the tree (B: maybe) or not (RF)

Now is there a big difference in performance? If yes which of the 2 is more important?
Yes, one could probably also look at Breimans paper for this but why not demonstrate it via a nice OMLR study?

There probably dozens of other interesting similar questions.

batch experiments errors

Resampling of the first 100 tasks with all adequate mlr learners.
Submitted: 1710 (100%)
Done: 1692 (98.95%)
Expired: 18 (1.05%)

There were 295 errors (17%). Here are some common error messages:

  • [114 times] Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

Quite clearly, this is due to "pseudo factors" that have only one level. I guess, we should remove those automatically in runTask().

  • [24 times] Error in knn.cv(x, cl, k) : NA/NaN/Inf in foreign function call (arg 6)
  • [21 times] Error in scale.default(newdata[, scaling(object)$scaled, drop = FALSE], : length of 'center' must equal the number of columns of 'x'
  • [19 times] Error in qda.default(x, grouping, ...) : rank deficiency in group [...].

If you want me to submit some of the rarer errors, too, or post more details on those above, please tell me.

dependencies could be handled better

> library(openML)
> data.chars <- getDataCharacteristics()
Fehler in openMLSQLQuery(SQL.query) : 
  konnte Funktion "str_replace_all" nicht finden
> library(stringr)
> data.chars <- getDataCharacteristics()
Fehler in openMLSQLQuery(SQL.query) : 
  konnte Funktion "fromJSON" nicht finden
> library(rjson)
> data.chars <- getDataCharacteristics()
SQL was processed: 130 rows selected. 

downloading of source/binary files

Hey,

at the moment, in getOMLFlow we do not download source/binary files at all. How should the function behave?

  1. where to save the files (cache?)
  2. always download all, or filter by file format or something like that?

As an "only R"-User, I'd not be interested in anything other than R-files, so a filter would be nice, I guess.

Authentication troubles

  • it is currently required for all actions to authenticate, this is nonsense?
  • the session hash should be stored in and retrieved from options (if not directly provided) after first call to authenticateUser.

create multiple mlr tasks from OpenML data sets

It seems that downloadOpenMLDataAsMlrTask() can only be used once per R Session, because some files in the temporary data folder can't be overwritten.

Here's what I've done:
task1 = downloadOpenMLDataAsMlrTask("ozone_level")
-> works fine

task2 = downloadOpenMLDataAsMlrTask("abalone")
-> Error

Error Message:
Downloading data set 'abalone' from OpenML repository.
Intermediate files (XML and ARFF) will be stored in : C:\Users\tob\AppData\Local\Temp\Rtmp6D2vf4
Downloading file: C:\Users\tob\AppData\Local\Temp\Rtmp6D2vf4/data_set_description.xml from:
http://openml.org/api/?f=openml.data.description&data.id=183
Error in downloadOpenMLDataAsMlrTask("abalone") :
Assertion on 'file' failed: File at path already exists: 'C:\Users\tob\AppData\Local\Temp\Rtmp6D2vf4\data_set.arff'
All intermediate XML and ARFF files are now removed.

Of course, it's possible to change the directory via the dir-option, but for the download of multiple data sets it would be nice just to use the same temporary path over and over.


sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] OpenML_1.0 devtools_1.5
loaded via a namespace (and not attached):
[1] BBmisc_1.7 checkmate_1.2 codetools_0.2-8 digest_0.6.4 evaluate_0.5.5
[6] grid_3.1.0 httr_0.3 memoise_0.2.1 mlr_2.1 parallel_3.1.0
[11] parallelMap_1.1 ParamHelpers_1.3 plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1
[16] rJava_0.9-6 rjson_0.2.14 RWeka_0.4-23 RWekajars_3.7.11-1 splines_3.1.0
[21] stringr_0.6.2 survival_2.37-7 tools_3.1.0 whisker_0.3-2 XML_3.98-1.1

Create less files

xmlParse accepts character(n) as input if asText == TRUE. We do not need to write everything to temp files first.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.