revolutionanalytics / azureml Goto Github PK

An R interface to AzureML(https://studio.azureml.net/) experiments, datasets, and web services.

License: Other

R 100.00%

azureml's Issues

publish/consume data frame input example fails

This example fails, not sure yet why:

  f <- function(a,b,c,d) list(sum = a+b+c+d, prod = a*b*c*d)
  ep <-  publishWebService(ws,
                           f,
                           name = "rowSums",
                           inputSchema = list(
                             a="numeric",
                             b="numeric",
                             c="numeric",
                             d="numeric"
                           ),
                           outputSchema = list(
                             sum ="numeric",
                             prod = "numeric")
  )
  x <- head(iris[,1:4])  # First four columns of iris

  # Note the following will FAIL because of a name mismatch in the arguments
  # (with an informative error):
  consume(ep, x, retryDelay=1)
  # We need the columns of the data frame to match the inputSchema:
  names(x) <- letters[1:4]
  # Now we can evaluate all the rows of the data frame in one call:
  consume(ep, x)
  # output should look like:
  #    sum    prod
  #1 10.2   4.998
  #2  9.5   4.116
  #3  9.4  3.9104
  #4  9.4   4.278
  #5 10.2    5.04
  #6 11.4 14.3208

Add explicit test to see if a zipper is available on the system

The publishWebservice() function depends on the availability of an external zipper.

Test to see if this is available before discovering dependencies and building the miniCRAN tree.

Change column order for print.dataset method

Suggested: Print columns

Name
Size
Type (csv, tcv, arff, etc.)

Add mechanism for rate limiting

As noted in #44, if requests are sent too frequently to the AzureML API, it is very easy to exceed the rate limit.

One possibility is to add multiple retry, possibly with exponential back-off (#48) to all functions that call the API

experiments() throws POSIXct warnings

Sample code:

ws <- workspace()
experiments(ws)

Result:

Warning messages:
1: In strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz) :
  unable to identify current timezone 'C':
please set environment variable 'TZ'
2: In strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz) :
  unknown timezone 'localtime'

Throw warning in services() if response is NULL

In addition to returning NULL, also throw a warning if the response is NULL.

Update examples to reflect new user API

Add refresh argument to deleteWebservices()

To delete multiple services, e.g. duplicates with the same name, is slow because of the call to refresh().

Warn if trying to download a zipped dataset

Right now, download.dataset() throws an error when trying to download a Zip file. ~~However, this error only happens after the API call returns.~~

We have two options:

Display a helpful message and stop before making the API call
Add ability to download Zip to local file system.

~~We should definitely do 1.~~

Should we also support the ability to download a Zip file?

Specify minimum dependencies on curl and jsonlite

Both curl and jsonlite have been in active development during 2015.

The minimum dependencies required are available from an MRAN snapshot of 2015-06-09:

jsonlite_0.9.16
curl_0.8

Make function names consistent

Compare:

download.datasets
download.intermediate.dataset

Both should be plural

Find a way to delete datasets

We need to support deletion of datasets for round-trip testing, i.e. upload dataset then delete dataset.

License & copyright

What are the appropriate statements? Where do we need to place them? Thanks
David, please answer or re-assign as appropriate. I don't have access to OSS training.

workspace() reads incorrect fields in config file

Fields in config file are:

workspace_id
authorization_token

Include AzureML discovery unit tests

Include tests for:

discover
publication
consumption

Specify geolocation url in workspace object

At the moment, the user can specify the URL in each function.

It would be a cleaner implementation to specify this in the workspace object, and then each function reads the url from the workspace object.

Services produce output of incorrect type

Working on an example, but the gist is that services that are supposed to return numeric values are instead returning character.

In callAPI() break immediately for certain types of errors

Not all errors returned by a 400 response is because the service isn't available.

Sometimes the error code contains helpful debugging information.

In these cases, callAPI() should break immediately, rather than retrying several times.

download.intermediate.dataset() not working properly

download.intermediate.dataset() returns a dataframe with a single column even though the data has multiple columns

Can you make it so the correct number of columns are returned?

Provide more examples and tests for dataframe as input to publishWebservice()

To do:

Move dataframe examples to top of examples
Create additional example that is simpler than lmer
Add unit tests

Deal with deprecated functions

At a minimum, list deprecated and/or defunct functions in ?AzureML-deprecated

publishWebservice() should throw an error if fun is not a function

I just spent several hours tracing some unexpected behaviour.

If the argument to publishWebservice() is a character (allowed in the previous incarnation of AzureML) then consume() always throws an error.

This throws an error:

ws <- workspace()
api <- publishWebService(
  ws,
  fun = "add", 
  name = "add",
  inputSchema = list(
    x = "numeric", 
    y = "numeric"
  ), 
  outputSchema = list(
    ans = "numeric"
  )
)

consume(api, df, retryDelay = 0.1)

Error:

> consume(api, df, retryDelay = 0.1)
Warning: Request failed with status 400. Retrying request...
List of 1
 $ error:List of 3
  ..$ code   : chr "LibraryExecutionError"
  ..$ message: chr "Module execution encountered an internal library error."
  ..$ details:'data.frame': 1 obs. of  3 variables:
  .. ..$ code   : chr "FailedToEvaluateRScript"
  .. ..$ target : chr "Execute R Script Piped (RPackage)"
  .. ..$ message: chr "The following error occurred during evaluation of R script: R_tryEval: return error: Error in do.call(\"..fun\", as.list(inputD"| __truncated__
NULL
 Show Traceback

 Rerun with Debug
 Error in callAPI(apiKey, requestUrl, requestsLists, globalParam, retryDelay) :

User story: download data from Azure ML studio into R

Create an API to download data from Azure ML studio into R

Python has an equivalent package: https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python

This package can be used directly by R users, and will also be used to support the Get Data Access Code feature for R in Azure ML, and a similar feature in Jupyter Notebook.

Give more descriptive error message when zip isn't available

User report:

I’m trying this on Windows… when I try to use publishWebService, I get:

Error in publishWebService(ws, fun = add, name = "AzureML-vignette-add",  : 
  Requires external zip utility. Please install zip and try again.

But Windows already has a zip utility, doesn’t It? I able to zip and unzip files from the File Explorer.

I installed http://www.7-zip.org/ but that didn’t work either. Any ideas what zip utility must be installed and how to make it work?

Since the zip utility should also be in the path, we can make this message more clear and explicit.

Add MIT license text to the top of each file

As required by Microsoft policy. The text is:

------------------------------------------ START OF LICENSE -----------------------------------------
RA-Internal-azureml ver. 0.1
Copyright (c) Microsoft Corporation
All rights reserved.
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ""Software""), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
----------------------------------------------- END OF LICENSE ------------------------------------------

Update unit tests after merge of Azure_funs

Tests need updating

make data.type option finite choice of available formats

instead of free character

Automatic discover of data frame classes in publishWebservice() throws error

Example:

ws <- workspace()

library(lme4)
set.seed(1)
model <- lmer(Reaction ~ Days + (Days | Subject), data=sleepstudy[sample(nrow(sleepstudy), 120),])
p <- function(newdata){
  predict(model, newdata=newdata)
}


ep <- publishWebService(ws, fun = p, name = "sleepy lmer 2",
                        inputSchema = sleepstudy,
                        data.frame=TRUE
)

Result:

 Error: sum(nb) == q is not TRUE

Use roxygen for help documentation

Andrie said:

Do you think it’s possible to use roxygen to document the package? I think this is easier in the long run, but I have no idea how this will work with R6 functions.

Antonio said:

I think it would clash with the export stuff, but I am not an R6 expert, so I am not sure. The problem is that roxygen doesn't work with R unless you use a subset of it. Since I use higher order functions, and there is no way of documenting a dynamically generated one, it can't be used with most of my code, azureml being no exception. The whole design of roxygen is anti-modular and inspired by an inferior static language that forces people to use preprocessors. I am just waiting patiently for the moment Roxygen fades away.

Andrie says: I think there is a way to get the best of both worlds. Not having to learn latex must be a benefit. I'll do some experiments to see if I can convert your .rd files into sensible roxygen.

Upload data from R into Azure ML studio

This is the inverse operation of User story #1

Create tests for all different types of dataset

The package supports csv, tsv and arff files.

We should update the tests to make sure!

Include geolocation fix identified in AzureML package

Details at cran/AzureML#1

Use public workspace in examples and tests

API URL:

https://ussouthcentral.services.azureml.net/workspaces/f5e8e9bc4eed4034b78567449cfca779/services/b4caf0f0ebfd451bbc187741894e213b/execute?api-version=2.0&details=true

host <- "https
host <- "ussouthcentral.services.azureml.net
endpointId <- "b4caf0f0ebfd451bbc187741894e213b"
workspaceId <- "f5e8e9bc4eed4034b78567449cfca779"

Update unit tests

Unit tests should reflect new user API

Add minimal unit test framework

I propose a method like the following to include tests in the package:

Create a config.txt file that contains placeholder values for the API keys. (Or just instructions on setting up the keys)
Add config.R to .gitignore
The test user has to manually create config.R on his own machine. (The .gitignore ensures this never gets uploaded to github.)
The unit tests read values from config.R – if this file doesn’t exist, then no tests are run.

test-6-services-consume.R contains keys

probably not a good idea?

Fix service publication mechanism

The old mechanism of using codetools does not discover all dependencies

Remove dependency on purrr ?

Andrie said:

I notice that your toolchain depends on the packages R6 and purr. The purr dependency worries me a little bit. The package isn’t on CRAN yet, so presumed to be still unstable? (minor concern)

Antonio said:

It seems stable enough to me but I am worried about distribution. What will happen when people try to install azureml from CRAN? I don't think there's a way the dependency will be fulfilled automatically. Let's think for a moment if there is an alternative to dropping the dependency. It improves code quality by quite a bit. It's a great package.

Update vignette

The vignette still uses the original publishWebservice() API

Use idiomatic R for API

Andrie said

Are you open to suggestions for what the API looks like?

For example, your current code has this:

irisaz1 = ws$datasets$get.item("Iris Two Class Data")
irisaz1$as.data.frame()

Perhaps we could define S3 methods to simplify the user experience to something like:

irisaz1 = ws$datasets["Iris Two Class Data"]
as.data.frame(irisaz1)

Antonio responded:

Absolutely. The goal right now is python parity, and being aligned with the equivalent python package has cut dev time and it's going to be easier to track what they do. But as far as the API, you noticed everything is public. That can't stay that way. It's more than accepting suggestions. Let's write an API

I just found out that s3 dispatch works on r6 classes with little fuss (http://stackoverflow.com/questions/28117585/proper-way-to-implement-s3-dispatch-on-r6-classes) so this is possible. As far as complexity:

irisaz1$as.data.frame()
        # VS
as.data.frame(irisaz1)

I have seen code complexity. That is not it.

Maybe more idiomatic R, yes. But that was a method that clearly lent itself to be associated with a known and loved generic. What should we do with get.item, datasets etc? Maybe there is a mapping, but it's not so clear to me. Using some generics and some methods, maybe, let me think about that. As far as following python, I think we can keep mimicking the object structure there with R6 objects and then layer generics on top of them as in

as.data.frame.Dataset = function(data) data$as.data.frame()

It is that simple, courtesy Winston Chang. Let's just keep these generics in a separate file because even the file structure follows the python original. You even find python in the comments (gradually being removed). Then when with restrict the exports, we can point mostly at the generic functions. Right now my focus is still on getting the functionality in, so if you want to lay it out, even just the names, but I have to focus on uploads.

Migrate AzureML functionality to azureml

Include:

service publication
service consumption

And:

unit tests
vignettes

getDetailsFromUrl() fails on some URLs

Use read-only AML workspace for tests and examples

Since we are using private workspaces for testing, we can't run tests or examples on CRAN.

If we can use a read-only workspace, we can include tests on CRAN as well as Travis.

Read workspace data from json file by default

The data is:

{"workspace":{
  "id":"test_id",
  "authorization_token": "test_token",
  "api_endpoint":"api_endpoint",
  "management_endpoint":"management_endpoint"
}}

The idea here is that if workspace() is called with no parameters we should try and read ~/.azureml/settings.json and get the workspace id/authorization token/end points and use the information there. That way we can keep the tokens out of the notebooks.

read settings file

equivalent to get_workspace_info in python. Not user accessible, provides default for Workspace constructor

Add ability to read .dataset formats (deserialization)

The .dataset format is used as the output of most modules in ML Studio (intermediate datasets). For example, the Split module results are in that format.

Studio currently disables the Generate Data Access Code and Open in Notebook features on those output nodes due to lack of deserialization support for that format in Python.

To access those intermediate datasets from Python code, the user needs to insert a Convert to CSV module. Note that this conversion loses some metadata, such as column type information. Pandas can infer the types most of the time, but sometimes it requires user post-processing.

Simplify documentation using additional roxygen

Specifically, make use of:

@inheritParams
@family

Modify mechanism to read JSON file in unit tests

The current mechanism is:

keyfile <- system.file("tests/testthat/config.json", package = "AzureML")

This is flawed since this json file can't be found during package build.

The fix might be to put this json file in /inst/...

Use exponential backoff when encountering 503 errors in consumeWebservice

A response of 503 means either:

The webservice is getting requests more than what its configured for (steady state error)
The webservice is initializing containers and AzureML is serializing the initialization (because initializing new containers takes time). This happens when there is a spike.

If the service encounters 503, it should retry with exponential backoff (up to n times per your business logic).

Note that the time it takes to initialize a container is dependent on the nature of the model that you are uploading.

revolutionanalytics / azureml Goto Github PK

azureml's Issues

Recommend Projects

Recommend Topics

Recommend Org