Git Product home page Git Product logo

Comments (45)

leeper avatar leeper commented on August 27, 2024 2

This is a slim and easy-to-use package that works exactly as described. The documentation is generally complete and all examples in docs and README run as expected. I think it's really close to being ready for acceptance, but there are a couple of issues and I have some suggestions on usability.

  1. I get an error when building vignettes from GitHub (build_vignettes = TRUE):

     Quitting from lines 35-41 (monkeylearn_intro.Rmd) 
     Error: processing vignette 'monkeylearn_intro.Rmd' failed with diagnostics:
     HTTP failure: 401
     Invalid token header. No credentials provided.
     Execution halted

    This seems to be due to requiring an API key for examples to run. For CRAN release, the code may need to be pre-executed so the vignettes can build. This issue also emerges on R CMD check in a few places:

    1. examples, which should probably be wrapped in \dontrun{}
    2. test suite, which probably needs to be run conditionally using whichever CRAN-identifying method is appropriate (version number, testthat's check, a simple if() based on API key availability, etc.).
  2. I also get a NAMESPACE note in R CMD check:

     * checking R code for possible problems ... NOTE
     monkeylearn_check: no visible binding for global variable 'text'
     monkeylearn_parse: no visible global function definition for 'is'
     Undefined global functions or variables:
       is text
     Consider adding
       importFrom("graphics", "text")
       importFrom("methods", "is")
     to your NAMESPACE file (and ensure that your DESCRIPTION Imports field
     contains 'methods').
    

    That is all pretty minor.

  3. The code itself also looks good and is very straightforward, seeing as there are only two functions: monkeylearn_extract() and monkeylearn_classify(). For both functions, the output is a list of two tbl_dfs. Only the first one "results" is really useful. The second, "headers", is basically request metadata. I think it would be more useful to return a single tbl_df and leave the "headers" information as attributes to that, in the form of a list. This will improve efficiency because the headers won't be run through both as.data.frame() and tbl_df() and will make the results easier to access.

  4. The only purpose for the headers response seems to be to track the x.query.limit.remaining HTTP header. I would handle this in two ways, both different from current behavior:

    1. Report this value when verbose = TRUE.
    2. Store the value in a non-exported package environment and have internal functions monkeylearn_get_extractor() and monkeylearn_get_classify() update the value after each request and then issue a warning when these values get low.
  5. The functions should use message() rather than print() for diagnostics when verbose = TRUE. This occurs in several places:

  6. The only major substantive suggestion I have for the package is to perhaps create some functions or variables to represent the classifier ID and extractor ID values for some common methods. For example, the README and examples require the user to input something like "ex_isnnZRbS" as the value of extractor_id ala monkeylearn_extract(request = text, extractor_id = "ex_y7BPYzNG"). If this could instead be something like monkeylearn_extract(request = text, extractor_id = .entity), monkeylearn_classify(request = text, classifier_id = .profanity), etc. it would greatly simplify the use of the most commonly needed methods.

  7. One really small code style comment: on if (), while (), etc. statements, I think Hadley, Google, and most style guides recommend spacing between brackets, so that:

     while(class(output) == "try-error" && try_number < 6){
     ...
     }

    becomes

     while (class(output) == "try-error" && try_number < 6) {
     ...
     }
  8. It might be useful to note in the README that MonkeyLearn supports registration through GitHub, which makes the registration process really easy.

  9. It might make the README a little flashier if it had some graphs. That's totally not necessary at all, but it seems like that captures peoples' attention better than pure text.

Overall, really easy to use and really nicely done.

from software-review.

noamross avatar noamross commented on August 27, 2024 1

Thank your for the review, @leeper!

@masalmon, you may yet receive another review, but go ahead and proceed with changes in the meantime.

from software-review.

leeper avatar leeper commented on August 27, 2024 1

Yup, that works. Then there's a consistent structure regardless of whether it's one call or many.

from software-review.

leeper avatar leeper commented on August 27, 2024 1

I'd say write an example of how to check rate limiting and if it's really complicated, perhaps write it into the package. If it turns out to be simple, then just leave it as an example and/or note it in the README.

from software-review.

maelle avatar maelle commented on August 27, 2024 1

I feel I can now answer the review comments. Thanks a lot, @leeper! And thanks @noamross for your contributions to the discussion!

Something I have done independently of this feedback was to add a small vignette where I do a small analysis of the book Little Women.

  1. Great point! I have added a chunk at the beginning of both vignettes now.
  2. Thanks, now there is no NOTE.
  3. Now the output is a data.frame with a data.frame as attributes for representing the headers.
  4. In the documentation of monkeylearn_classify and monkeylearn_extract and in the README+vignette I explain how to get the remaining number of calls, so the user could use them as they wish.
  5. Thanks, I replaced print with message.
  6. Thanks, this is a great suggestion. This is what I have done:
  • For classifiers, when I asked again about a list of all modules, Monkeylearn was kind enough to add an endpoint, which they have not done for extractors because there are less of them. I added the function monkeylearn_classifiers which returns a data.frame of all classifiers, or only of private ones, with their IDs, names and other information such as the language. So the user could use this table to extract the IDs of the classifiers they want to use.
  • For extractors, in the README and vignette I only mention that one can go to the website to browse existing extractors and then save their ID as an environment variable. The reason why I did not add a data.frame of some extractors myself as a .RData is that I would have to choose which characteristics of the extractors to write in this table and for some applications there are different extractors (e.g. sentiment analysis trained on Tweets vs on hotel reviews) and my list would soon be out of date (which by the way is bad for the README and vignette too)... and also I not-so-secretly hope there'll be an endpoint for getting all of them, which might happen when one can customize extractors. Is it ok if I simply wait for a future version of the API before making the extractors IDs easier to find?
  1. I added the missing spaces.
  2. I added the sentence "Note that MonkeyLearn supports registration through GitHub, which makes the registration process really easy.".
  3. As soon as I find a really good idea, I'll add that (probably as a advertisement for a supplementary vignette about one use case)! My current idea is to maybe use the R interface to the Urban Dictionary's API, get random definitions, and look for profanity and abuse detection, before plotting how many of the definitions have a high score. I am open to any cooler suggestion! In the meantime, is it ok I don't have any graph in the README?

Thanks again for the feedback, I'm really happy to be able to improve the package!

from software-review.

leeper avatar leeper commented on August 27, 2024 1

Nice. Looks good to me!

from software-review.

leeper avatar leeper commented on August 27, 2024 1

Sure thing.

from software-review.

noamross avatar noamross commented on August 27, 2024 1

Accepted! Thank you @masalmon and @leeper for your review and follow-up. @masalmon, I believe you know the drill from here.

from software-review.

maelle avatar maelle commented on August 27, 2024 1

Wow this was quick, thanks a lot @sckott !

from software-review.

mdsumner avatar mdsumner commented on August 27, 2024 1

Well done!

On Sun, 11 Sep 2016, 17:35 Maรซlle Salmon [email protected] wrote:

For info the package is on CRAN now. Thanks again for the review!

โ€”
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#45 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AD6tb4yotFZ8lV3DRWPM7KZQpsHF4H5Pks5qo69CgaJpZM4IiCUj
.

Dr. Michael Sumner
Software and Database Engineer
Australian Antarctic Division
203 Channel Highway
Kingston Tasmania 7050 Australia

from software-review.

sckott avatar sckott commented on August 27, 2024

Thanks for your submission @masalmon - We'll get back to you soon

from software-review.

noamross avatar noamross commented on August 27, 2024

Hi @masalmon, I'll be your handling editor and am currently seeking reviewers. FYI, this submission prompted discussion among the editors on two topics:

  • Fit: Previous API wrappers we've accepted had fit under accessing and depositing in databases, geospatial processing, and plotting services. This package fit none of those, but clearly fit in with a growing theme - extracting data from unstructured sources (pdftools and tabulizer being examples). We've added a new category to our policies document to explicitly include packages, API wrappers or not, in this category, without expanding to "any API wrapper".
  • API Stability: With so many processing-as-a-service startups in the market, its inevitable that APIs will go through periods of instability and maturation, and many may be short-lived, and thus package maintenance requirements may be high or packages may become obsolete. We came to no conclusion from this, but will be keeping an eye on API wrapper packages to see how our system copes with this churn.

from software-review.

maelle avatar maelle commented on August 27, 2024

Cool, thanks. Actually geoparser also extracts data from unstructured text & depends on a processing-as-a-service startup although it's a geospatial thing too.

I know I'm creating work for myself later with API packages. ropenaq also breaks when the API changes... I hope I'll be able to cope & that I'll be able to find replacements for the APIs which will disappear ๐Ÿ˜ธ

from software-review.

noamross avatar noamross commented on August 27, 2024

Reviewers: @hilaryparker @leeper
Due date: 2016-06-15

from software-review.

sckott avatar sckott commented on August 27, 2024

@hilaryparker
Due date: 2016-06-15 - hey there, it's been 21 days, please get your review in soon, thanks ๐Ÿ˜บ (ropensci-bot)

from software-review.

ropenscibot avatar ropenscibot commented on August 27, 2024

@hilaryparker
Due date: 2016-06-15 - hey there, it's been 29 days, please get your review in soon, thanks ๐Ÿ˜บ (ropensci-bot)

from software-review.

ropenscibot avatar ropenscibot commented on August 27, 2024

@hilaryparker
Due date: 2016-06-15 - hey there, it's been 45 days, please get your review in soon, thanks ๐Ÿ˜บ (ropensci-bot)

from software-review.

noamross avatar noamross commented on August 27, 2024

Additional reviewer: @leeper

from software-review.

noamross avatar noamross commented on August 27, 2024

One suggestion regarding outputting tbl_df()'s from your functions. Per the update to tibble and this comment from Hadley, it may make sense to replace tbl_df() (as well as as.data.frame()) with tibble::as_tibble() to anticipate future changes in the tidyverse. This won't increase the package weight as dplyr already depends on tibble. If you do so, specify tibble (>= 1.1) in Imports:.

from software-review.

maelle avatar maelle commented on August 27, 2024

Many thanks!! ๐Ÿ‘ I am on holidays this week so will implement the suggested changes next week. โ˜บ

from software-review.

maelle avatar maelle commented on August 27, 2024

Hi @leeper, thanks again, this will be very useful!

I have only questions about 2 points:

  • I'm not sure I understand what you mean by "I think it would be more useful to return a single tbl_df and leave the "headers" information as attributes to that, in the form of a list"? Do you mean adding the information from request as columns to the results data frame instead of having them separately? So that there would be a column "x.query.limit.request.queries" with the same value for the whole data frame? But you say "list", so I'm pretty sure I'm misunderstanding your suggestion.
  • And does the following point mean you think that the remaining number of queries should not be part of the output, or that there should be an additional warning?

I had chosen to output a list of data frames because I did it this way for ropenaq but obviously I'm open to changes. ๐Ÿ˜€

from software-review.

noamross avatar noamross commented on August 27, 2024

@masalmon R objects have an "attributes" component that stores metadata such as dimensions and names, but also can hold arbitrary R objects as metadata. See this section in Advanced R, the R Manual section, ?attributes and ?attr. So you can attach the headers to your output and access it with attributes(output)$headers or attr(output, "headers"). I believe @leeper is suggesting that since this header data isn't inherently tabular and consists of a few values, just leaving it as a list makes more sense than converting it to a data frame/tibble.

from software-review.

maelle avatar maelle commented on August 27, 2024

@noamross cool, thank you!

from software-review.

maelle avatar maelle commented on August 27, 2024

Okay so I now see why it (maybe) made sense to have headers as a tibble: when the input is a vector of texts (and not a single text), the headers is a table with one line per original text. It would be harder to have this if headers were a list. I had the same strategy in geoparser.

I'd tend to keep this because I don't know any better solution for this case, and if I do, I'd replace the current "text index" in the headers and results tables by a md5 checksum as I currently do in geoparser.

Any thought?

from software-review.

maelle avatar maelle commented on August 27, 2024

Reg. "The only major substantive suggestion I have for the package is to perhaps create some functions or variables to represent the classifier ID and extractor ID values for some common methods.", I'm trying to get a table of all public modules with their ID and name, if Monkeylearn can send me one (I already know it cannot be obtained automatically). If they can't, I'll do a small table myself and use it.

from software-review.

maelle avatar maelle commented on August 27, 2024

Monkeylearn has actually started working on an API endpoint for finding public modules so I'll wait for its being ready for updating the package ๐Ÿ˜Š

from software-review.

leeper avatar leeper commented on August 27, 2024

Sorry for the being incommunicado on this. What I meant in terms of headers is - as @noamross describes - that I would have the functions return something like:

structure(data.frame(...), headers = list(...))

rather than:

structure(list(results = data.frame(...), headers = data.frame(...)))

It seems like the headers are secondary information in most cases, so this simplifies accessing the results. Whether results are a df or tibble...either way seems fine.

Sounds good on waiting to implement a list of modules!

from software-review.

maelle avatar maelle commented on August 27, 2024

No problem & thanks for your answer!

But in the case where the output data.frame is actually obtained from binding results from several calls to the API, would you expect the headers to be a list of list?

from software-review.

noamross avatar noamross commented on August 27, 2024

How about having the headers as a df/tibble, but still as an attribute of
the results?

On Tue, Jul 26, 2016 at 9:21 AM Maรซlle Salmon [email protected]
wrote:

No problem & thanks for your answer!

But in the case where the output data.frame is actually obtained from
binding results from several calls to the API, would you expect the headers
to be a list of list?

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#45 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAi5aKX0c6dzab3sdJrqWK2YHQZkU-gCks5qZgnLgaJpZM4IiCUj
.

from software-review.

maelle avatar maelle commented on August 27, 2024

That sounds like a good compromise!

from software-review.

maelle avatar maelle commented on August 27, 2024

The function now outputs a data.frame with headers as a df attribute. ๐Ÿ˜„

I am still not sure about the following part of my code in both functions:

    output <- tryCatch(monkeylearn_get_extractor(request_part, key, extractor_id))
    # for the case when the server returns nothing
    # try 5 times, not more
    try_number <- 1
    while(class(output) == "try-error" && try_number < 6) {
      message(paste0("Server returned nothing, trying again, try number", i))
      Sys.sleep(2^try_number)
      output <- tryCatch(monkeylearn_get_extractor(request_part, key, extractor_id))
      try_number <- try_number + 1
    }

    # check the output -- if it is 429 try again (throttle limit)
    while(!monkeylearn_check(output)) {
      output <- monkeylearn_get_extractor(request_part, key, extractor_id)
    }

So if the server returns nothing I re-try up to 5 times with each time more waiting time. And if the response from the server has a 429 code which indicates throttle limit, I wait 60 seconds.

I feel this is quite clumsy, maybe I could make this better? Any suggestion would be welcome.

Note that I read https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html "For many APIs, the common approach is to retry API calls that return something in the 500 range. However, when doing this, itโ€™s extremely important to make sure to do this with some form of exponential backoff: if somethingโ€™s wrong on the server-side, hammering the server with retries may make things worse, and may lead to you exhausting quota (or hitting other sorts of rate limits). A common policy is to retry up to 5 times, starting at 1s, and each time doubling and adding a small amount of jitter (plus or minus up to, say, 5% of the current wait time)."

Now another question. @leeper now that headers is a part of the output as an attribute of the data.frame, do you still think reporting x.query.limit.remaining HTTP when verbose=TRUE makes sense? Now I feel that the users themselves could implement something in their code based on x.query.limit.limitand x.query.limit.remaining, which I will make more explicit in the documentation.

from software-review.

maelle avatar maelle commented on August 27, 2024

@leeper I had added you to the DESCRIPTION file without asking first, I'm sorry. Are you ok with being in the DESCRIPTION?

from software-review.

noamross avatar noamross commented on August 27, 2024

I was running final checks and I'm getting an error sometimes when running line 94 of the guternbergr vignette. It looks like the vignettes don't get tested on Travis?:

> keywords <- monkeylearn_extract(request = little_women_chapters$text,
                                 extractor_id = "ex_y7BPYzNG",
                                params = list(max_keywords = 3))
Error in match.names(clabs, names(xi)) : 
  names do not match previous names

This appears to arise in extractor.R#97: headers <- rbind(headers, output$headers). When I dive in using options(error=recover) I find that these two objects have different names:

Browse[1]> str(headers)
'data.frame':   1 obs. of  11 variables:
 $ allow                        : Factor w/ 1 level "POST, OPTIONS": 1
 $ content.type                 : Factor w/ 1 level "application/json": 1
 $ date                         : Factor w/ 1 level "Tue, 02 Aug 2016 22:38:54 GMT": 1
 $ server                       : Factor w/ 1 level "nginx/1.8.0": 1
 $ vary                         : Factor w/ 1 level "Accept, Cookie": 1
 $ x.query.limit.limit          : Factor w/ 1 level "50000": 1
 $ x.query.limit.remaining      : Factor w/ 1 level "49101": 1
 $ x.query.limit.request.queries: Factor w/ 1 level "20": 1
 $ content.length               : Factor w/ 1 level "11274": 1
 $ connection                   : Factor w/ 1 level "keep-alive": 1
 $ text_md5                     :List of 1
  ..$ : chr  "ff08da4717ca6fadcc5496aa51aad2ea" "0c368463a9e7135209a3ac1eaf3e6120" "1edee1e8747131a45212df1bc007001d" "b765601afb831ea7adb5a89b2d3f5803" ...
Browse[1]> str(output$headers)
'data.frame':   1 obs. of  11 variables:
 $ allow                        : Factor w/ 1 level "POST, OPTIONS": 1
 $ content.type                 : Factor w/ 1 level "application/json": 1
 $ date                         : Factor w/ 1 level "Tue, 02 Aug 2016 22:48:50 GMT": 1
 $ server                       : Factor w/ 1 level "nginx/1.8.0": 1
 $ vary                         : Factor w/ 1 level "Accept, Cookie": 1
 $ x.query.limit.limit          : Factor w/ 1 level "50000": 1
 $ x.query.limit.remaining      : Factor w/ 1 level "49081": 1
 $ x.query.limit.request.queries: Factor w/ 1 level "20": 1
 $ transfer.encoding            : Factor w/ 1 level "chunked": 1
 $ connection                   : Factor w/ 1 level "keep-alive": 1
 $ text_md5                     :List of 1
  ..$ : chr  "144b8166ed539c3f712583460143d534" "c562796c5666bb759bbf97b9bc2963c9" "55b5bd0e6b4cd459e574db7e86b88407" "69c7ebf70662c78722ba1a7fccea1de2" ...

One has transfer.encoding and one has content.length, which have different information. I don't know why these are different in different calls. If they should be, dplyr::bind_rows() can handle this and just put NAs for missing values.

from software-review.

maelle avatar maelle commented on August 27, 2024

Oh thanks a lot @noamross , fixing this. Actually vignettes seem to get tested on Travis: https://travis-ci.org/masalmon/monkeylearn/builds/148549225

@feconroses is it normal for headers to sometimes contain transfer.encoding and sometimes content.length?

from software-review.

feconroses avatar feconroses commented on August 27, 2024

@masalmon in all honesty, these headers aren't really relevant from a MonkeyLearn point of view. These headers (transfer.encoding and content.length) are generated automatically by nginx, they aren't really relevant for the text analysis made by MonkeyLearn or its process.

from software-review.

maelle avatar maelle commented on August 27, 2024

@feconroses ok thanks a lot!

@noamross has my commit solved the bug on your PC?

from software-review.

noamross avatar noamross commented on August 27, 2024

Yes, that has solved it, all checks now passing!

I hope you don't mind one more tweak at the end of a long review:. We're starting to use Gabor Csardi's goodpractice package to check for antipatterns in code. Running goodpractice::gp() yielded two small suggestions. After these you should be good to go.

  โœ– avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data.
    Consider using vapply() instead.

    R/utils.R:97:44
    R/utils.R:105:25
    R/utils.R:112:28

  โœ– avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error
    prone and result 1:0 if the expression on the right hand side is zero. Use seq_len() or seq_along()
    instead.

    R/classify.R:45:12
    R/extractor.R:70:12

from software-review.

maelle avatar maelle commented on August 27, 2024

Oh nice I didn't know about this package, I'll soon make the corresponding corrections. Thank you!

from software-review.

maelle avatar maelle commented on August 27, 2024

@noamross now done! :-)

from software-review.

maelle avatar maelle commented on August 27, 2024

Thanks for your patience and help, @noamross, now I really got rid of sapply.

from software-review.

maelle avatar maelle commented on August 27, 2024

Awesome! Yes I do. Thanks again @leeper and @noamross !

from software-review.

maelle avatar maelle commented on August 27, 2024

@sckott could you please create the monkeylearn project in the ropenscilabs Appveyor account and add me as an user? thank you!

from software-review.

feconroses avatar feconroses commented on August 27, 2024

Thanks @masalmon, you are awesome and thanks for making this happen! @leeper, @noamross and @sckott thanks for your time and help :)

from software-review.

sckott avatar sckott commented on August 27, 2024

@masalmon the appveyor is set up

from software-review.

maelle avatar maelle commented on August 27, 2024

For info the package is on CRAN now. Thanks again for the review!

from software-review.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.