dmi3kno / polite Goto Github PK

View Code? Open in Web Editor NEW

324.0 6.0 13.0 1.84 MB

Be nice on the web

Home Page: https://dmi3kno.github.io/polite/

License: Other

R 100.00%

robotstxt crawler webscraping scraper r rstats r-package memoise rate-limiter rvest

polite's Introduction

polite

The goal of polite is to promote responsible web etiquette.

“bow and scrape” (verb):

To make a deep bow with the right leg drawn back (thus scraping the floor), left hand pressed across the abdomen, right arm held aside.

(idiomatic, by extension) To behave in a servile, obsequious, or excessively polite manner. [1]
Source: Wiktionary, The free dictionary

The package’s two main functions bow and scrape define and realize a web harvesting session. bow is used to introduce the client to the host and ask for permission to scrape (by inquiring against the host’s robots.txt file), while scrape is the main function for retrieving data from the remote server. Once the connection is established, there’s no need to bow again. Rather, in order to adjust a scraping URL the user can simply nod to the new path, which updates the session’s URL, making sure that the new location can be negotiated against robots.txt.

The three pillars of a polite session are seeking permission, taking slowly and never asking twice.

The package builds on awesome toolkits for defining and managing http sessions (httr and rvest), declaring the user agent string and investigating site policies (robotstxt), and utilizing rate-limiting and response caching (ratelimitr and memoise).

Installation

You can install polite from CRAN with:

install.packages("polite")

Development version of the package can be installed from Github with:

install.packages("remotes")
remotes::install_github("dmi3kno/polite")

Basic Example

This is a basic example which shows how to retrieve the list of semi-soft cheeses from www.cheese.com. Here, we authenticate a session and then scrape the page with specified parameters. Behind the scenes polite retrieves robots.txt, checks the URL and user agent string against it, caches the call to robots.txt and to the web page and enforces rate limiting.

library(polite)
library(rvest)

session <- bow("https://www.cheese.com/by_type", force = TRUE)
result <- scrape(session, query=list(t="semi-soft", per_page=100)) %>%
  html_node("#main-body") %>% 
  html_nodes("h3") %>% 
  html_text()
head(result)
#> [1] "3-Cheese Italian Blend"  "Abbaye de Citeaux"      
#> [3] "Abbaye du Mont des Cats" "Adelost"                
#> [5] "ADL Brick Cheese"        "Ailsa Craig"

Extended Example

You can build your own functions that incorporate bow, scrape (and, if required, nod). Here we will extend our inquiry into cheeses and will download all cheese names and URLs to their information pages. Let’s retrieve the number of pages per letter in the alphabetical list, keeping the number of results per page to 100 to minimize number of web requests.

library(polite)
library(rvest)
library(purrr)
library(dplyr)

session <- bow("https://www.cheese.com/alphabetical")

# this is only to illustrate the example.
letters <- letters[1:3] # delete this line to scrape all letters

responses <- map(letters, ~scrape(session, query = list(per_page=100,i=.x)) )
results <- map(responses, ~html_nodes(.x, "#id_page li") %>% 
                           html_text(trim = TRUE) %>% 
                           as.numeric() %>%
                           tail(1) ) %>% 
           map(~pluck(.x, 1, .default=1))
pages_df <- tibble(letter = rep.int(letters, times=unlist(results)),
                   pages = unlist(map(results, ~seq.int(from=1, to=.x))))
pages_df
#> # A tibble: 6 × 2
#>   letter pages
#>   <chr>  <int>
#> 1 a          1
#> 2 b          1
#> 3 b          2
#> 4 c          1
#> 5 c          2
#> 6 c          3

Now that we know how many pages to retrieve from each letter page, let’s rotate over letter pages and retrieve cheese names and underlying links to cheese details. We will need to write a helper function. Our session is still valid and we don’t need to nod again, because we will not be modifying a page URL, only its parameters (note that the field url is missing from scrape function).

get_cheese_page <- function(letter, pages){
 lnks <- scrape(session, query=list(per_page=100,i=letter,page=pages)) %>% 
    html_nodes("h3 a")
tibble(name=lnks %>% html_text(),
       link=lnks %>% html_attr("href"))
}

df <- pages_df %>% pmap_df(get_cheese_page)
df
#> # A tibble: 518 × 2
#>    name                    link                     
#>    <chr>                   <chr>                    
#>  1 Abbaye de Belloc        /abbaye-de-belloc/       
#>  2 Abbaye de Belval        /abbaye-de-belval/       
#>  3 Abbaye de Citeaux       /abbaye-de-citeaux/      
#>  4 Abbaye de Tamié         /tamie/                  
#>  5 Abbaye de Timadeuc      /abbaye-de-timadeuc/     
#>  6 Abbaye du Mont des Cats /abbaye-du-mont-des-cats/
#>  7 Abbot’s Gold            /abbots-gold/            
#>  8 Abertam                 /abertam/                
#>  9 Abondance               /abondance/              
#> 10 Acapella                /acapella/               
#> # … with 508 more rows

Another example

Bob Rudis is one the vocal proponents of an online etiquette in the R community. If you have never seen his robots.txt file, you should definitely check it out! Lets look at his blog. We don’t know how many pages will the gallery return, so we keep going until there’s no more “Older posts” button. Note that I first bow to the host and then simply nod to the current scraping page inside the while loop.

    library(polite)
    library(rvest)
    
    hrbrmstr_posts <- data.frame()
    url <- "https://rud.is/b/"
    session <- bow(url)
    
    while(!is.na(url)){
      # make it verbose
      message("Scraping ", url)
      # nod and scrape
      current_page <- nod(session, url) %>% 
        scrape(verbose=TRUE)
      # extract post titles
      hrbrmstr_posts <- current_page %>% 
        html_nodes(".entry-title a") %>% 
        polite::html_attrs_dfr() %>% 
        rbind(hrbrmstr_posts)
      # see if there's "Older posts" button
      url <- current_page %>% 
        html_node(".nav-previous a") %>% 
        html_attr("href")
    } # end while loop
    
    tibble::as_tibble(hrbrmstr_posts)
    #> # A tibble: 578 x3

We organize the data into the tidy format and append it to our empty data frame. At the end we will discover that Bob has written over 570 blog articles, which I very much recommend anyone to check out.

Polite for package developers

If you are developing a package which accesses the web, polite can be used either as a template, or as a backend for your polite web session.

Polite template

Just before its ascension to CRAN, the package acquired new functionality for helping package developers get started on creating polite web tools for the users. Any modern package developer is probably familiar with excellent usethis package by Rstudio team. usethis is a collection of scripts for automating package development workflow. Many usethis functions automating repetitive tasks start with prefix use_ indicating that what followed will be adopted and “used” by the package user developes. For details about use_ family of functions, see package documentation.

{polite} has one usethis-like function called polite::use_manners().

polite::use_manners()

When called within the analysis (or package) directory, it creates a new file called R/polite-scrape.R (creating R directory if necessary) and populates it with template functions for creating polite web-scraping session. The functions provided by polite::use_manners() are drop-in replacements for two of the most popular tools in web-accessing R ecosystem: read_html() and download.file(). The only difference is that these functions have polite_ prefix. In all other respects they should have look and feel of the original, i.e. in most cases you should be able to simply replace calls to read_html() with polite_read_html() and download.file with polite_download_file() and your code should work (provided you scrape from a url, which it the first required argument in both functions).

Polite backend

Recent addition to polite package is a purrr-like adverb politely() which can make any web-accessing function “polite” by wrapping it with a code which delivers on four pillars of polite session:

Introduce Yourself, Seek Permission, Take Slowly and Never Ask Twice.

Adverbs can be useful, when a user (package developer) wants to “delegate” polite session handling to external package, without modifying the existing code. The only thing user needs to do is wrap existing verb with politely() and use the new function instead of the original.

Let’s say you wanted to use httr::GET for accessing certain API, such as musicbrainz and extract certain data from a deeply nested list, returned by the server. Your originally developed code looks like this:

library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
library(httr)
library(xml2)
library(purrr)

beatles_res <- GET("https://musicbrainz.org/ws/2/artist/", 
                   query=list(query="Beatles", limit=10),
                   httr::accept("application/json")) 
if(!is.null(beatles_res)) beatles_lst <- httr::content(beatles_res, type = "application/json")

str(beatles_lst, max.level = 2)
#> List of 4
#>  $ created: chr "2022-08-03T01:25:54.433Z"
#>  $ count  : int 169
#>  $ offset : int 0
#>  $ artists:List of 10
#>   ..$ :List of 12
#>   ..$ :List of 12
#>   ..$ :List of 10
#>   ..$ :List of 7
#>   ..$ :List of 8
#>   ..$ :List of 6
#>   ..$ :List of 9
#>   ..$ :List of 5
#>   ..$ :List of 6
#>   ..$ :List of 6

This code does not comply with polite principles. It does not provide human-readable user-agent string, it does not consult robots.txt about permissions. It is possible to run this code in the loop and (accidentally) overwhelm the server with requests. It does not cache the results, so if this code is re-run again, data will be re-queried.

You could write your own infastructure for handling useragent, robots.txt, rate limiting and memoisation, or you could simply use an adverb politely() which does all of these things for you.

Querying colormind.io with polite backend

Here’s an example from using colormind.io API. We will need a couple of service functions to convert colors between HEX and RGB and to prepare a json required by the service.

rgba2hex <- function(r,g,b,a) {grDevices::rgb(r, g, b, a, maxColorValue = 255)}

hex2rgba <- function(x, alpha=TRUE){t(grDevices::col2rgb(x, alpha = alpha))}

prepare_colormind_query <- function(x, model){
  lst <- list(model=model)

  if(!is.null(x)){
    x <- utils::head(c(x, rep(NA_character_, times=4)), 5) # pad it with NAs
    x_mat <- hex2rgba(x)
    x_lst <- lapply(seq_len(nrow(x_mat)), function(i) if(x_mat[i,4]==0) "N" else x_mat[i,1:3])
    lst <- c(list(input=x_lst), lst)
  }
  jsonlite::toJSON(lst, auto_unbox = TRUE)
}

Now all we have to do is to “wrap” existing function in the politely adverb. Then call the new function insted of original. You dont need to change anything other than a function name.

polite_GET <- politely(httr::GET, verbose=TRUE) 

#res <- httr::GET("http://colormind.io/list") # was
res <- polite_GET("http://colormind.io/list") # now
#> Fetching robots.txt
#> rt_robotstxt_http_getter: normal http get
#> Warning in request_handler_handler(request = request, handler = on_not_found, :
#> Event: on_not_found
#> Warning in request_handler_handler(request = request, handler =
#> on_file_type_mismatch, : Event: on_file_type_mismatch
#> Warning in request_handler_handler(request = request, handler =
#> on_suspect_content, : Event: on_suspect_content
#> 
#> New copy robots.txt was fetched from http://colormind.io/robots.txt
#> Total of 0 crawl delay rule(s) defined for this host.
#> Your rate will be set to 1 request every 5 second(s).
#> Pausing...
#> Scraping: http://colormind.io/list
#> Setting useragent: polite R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu) bot
jsonlite::fromJSON(httr::content(res, as = "text"))$result
#> [1] "ui"                  "default"             "the_wind_rises"     
#> [4] "lego_movie"          "stellar_photography" "game_of_thrones"

The backend functionality of polite can be used for any function as long as it has url argument (or the first argument is a url). Here’s an example of polite POST created with adverb politely.

polite_POST <- politely(POST, verbose=TRUE) 

clue_colors <-c(NA, "lightseagreen", NA, "coral", NA)

req <- prepare_colormind_query(clue_colors, "default")

#res <- httr::POST(url='http://colormind.io/api/', body = req) #was
res <- polite_POST(url='http://colormind.io/api/', body = req) #now
#> Fetching robots.txt
#> rt_robotstxt_http_getter: cached http get
#> Warning in request_handler_handler(request = request, handler = on_not_found, :
#> Event: on_not_found
#> Warning in request_handler_handler(request = request, handler =
#> on_file_type_mismatch, : Event: on_file_type_mismatch
#> Warning in request_handler_handler(request = request, handler =
#> on_suspect_content, : Event: on_suspect_content
#> 
#> Found the cached version of robots.txt for http://colormind.io/robots.txt
#> Total of 0 crawl delay rule(s) defined for this host.
#> Your rate will be set to 1 request every 5 second(s).
#> Pausing...
#> Scraping: http://colormind.io/api/
#> Setting useragent: polite R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu) bot
res_json <- httr::content(res, as = "text")
res_mcol <- jsonlite::fromJSON(res_json)$result
colrs <- rgba2hex(res_mcol)
scales::show_col(colrs, ncol = 5)

Querying musicbrainz API with polite backend

Musicbrainz API allows querying data on artists, releases, labels and all things music. API endpoint, unfortunately, is Disallowed in robots.txt, but it is completely legal to access for small size requests. Mass querying is easier using a datadump, with musicbrainz published periodically. We can create polite GET and turn off robots.txt validation.

library(polite)
polite_GET_nrt <- politely(GET, verbose=TRUE, robots = FALSE) # turn off robotstxt checking

beatles_lst <- polite_GET_nrt("https://musicbrainz.org/ws/2/artist/", 
                   query=list(query="Beatles", limit=10),
                   httr::accept("application/json")) %>% 
  httr::content(type = "application/json")
#> Pausing...
#> Scraping: https://musicbrainz.org/ws/2/artist/
#> Setting useragent: polite R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu) bot
str(beatles_lst, max.level = 2)
#> List of 4
#>  $ created: chr "2022-08-03T01:25:54.433Z"
#>  $ count  : int 169
#>  $ offset : int 0
#>  $ artists:List of 10
#>   ..$ :List of 12
#>   ..$ :List of 12
#>   ..$ :List of 10
#>   ..$ :List of 7
#>   ..$ :List of 8
#>   ..$ :List of 6
#>   ..$ :List of 9
#>   ..$ :List of 5
#>   ..$ :List of 6
#>   ..$ :List of 6

Lets parse the response

options(knitr.kable.NA = '')
beatles_lst %>%   
  extract2("artists") %>% 
  {tibble::tibble(id=map_chr(.,"id", .default=NA_character_),
                  match_pct=map_int(.,"score", .default=NA_character_),
                  type=map_chr(.,"type", .default=NA_character_),
                  name=map_chr(., "name", .default=NA_character_),
                  country=map_chr(., "country", .default=NA_character_),
                  lifespan_begin=map_chr(., c("life-span", "begin"),.default=NA_character_),
                  lifespan_end=map_chr(., c("life-span", "end"),.default=NA_character_)
                  )
    } %>% knitr::kable(col.names = c(id="Musicbrainz ID", match_pct="Match, %", 
                                     type="Type", name="Name of artist",
                                     country="Country", lifespan_begin="Career begun",
                                     lifespan_end="Career ended"))

Musicbrainz ID	Match, %	Type	Name of artist	Country	Career begun	Career ended
b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d	100	Group	The Beatles		1957-03	1970-04-10
5e685f9e-83bb-423c-acfa-487e34f15ffd	78	Group	The Tape-beatles	US	1986-12
1019b551-eba7-4e7c-bc7d-eb427ef54df2	75	Group	Blues Beatles	BR
5a45e8c5-e8e5-4f05-9429-6dd00f0ab50b	75	Group	Instrumental Beatles
e897e5fc-2707-49c8-8605-be82b4664dc5	74	Group	Sex Beatles
74e70126-def2-4b76-a001-ed3b96080e24	74		Powdered Beatles
bc569a61-dd62-4758-86c6-e99dcb1fdda6	74		Tokyo Beatles	JP
3133aeb8-9982-4e11-a8ff-5477996a80bf	74		Beatles Chillout
5d25dbfb-7558-45dc-83dd-6d1176090974	74		Daft Beatles
bdf09e36-2b82-44ef-8402-35c1250d81e0	74		Zyklon Beatles

Learn more

Ethical webscraper manifesto

Package logo uses elements of a free image by pngtree.com

[1] Wiktionary (2018), The free dictionary, retrieved from https://en.wiktionary.org/wiki/bow_and_scrape

polite's People

Contributors

Stargazers

Watchers

Forkers

gridl jjbender rtaph sowla gregrs-uk sahanduiuc cuulee gmichel-paaneah frantisek901 timyers bisaloo tspen jesse-ross

polite's Issues

More control over handling non-200 responses when scraping

I feel like this is a rather vague feature request, but hopefully the example below will help to illustrate my point. I think polite is great project and I'd like to see it used more widely.

With httr you can ask for the response code from a GET request to a URL, and then choose what action to take if, for example, the code is ! == 200. polite::scrape uses httr I believe, but handles the response internally, choosing to return NULL from a 404 for example. I'm wondering if it could be made less opinionated.

Here's a scraping script I wrote the other day, using purrr::map_dfr to combine responses into a single tibble. But if one of a list of URLs returns a 404 then the NULL value breaks the whole thing. I can get round this by rewriting the script (ex 2 below), or by using purrr::possibly (ex 3 below) or maybe by just using map with a reduce(bind_rows) ... but it might be good if polite gave the user more freedom internally as to how it should handle missing or invalid URLs rather than necessarily returning NULL.

I hope that makes sense. Here's my examples:

library(dplyr)
library(polite)
library(purrr)
library(rvest)
library(stringr)

url_root <- "https://www.ongelukvandaag.nl/archief/"

# create three URLs to test
urls <- paste0(url_root, 10:12, "-01-2015") # second URL returns 404

session <- polite::bow(
  url = url_root,
  user_agent = "Francis Barton [email protected]",
  delay = 3
)

function 1

scrape_page <- function(url) {
  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html", verbose = TRUE)

  headings <- page_text %>%
    rvest::html_nodes("h2") %>%
    rvest::html_text()

  dates <- page_text %>%
    rvest::html_nodes(".text-muted") %>%
    rvest::html_text() %>%
    stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

  dplyr::tibble(headings = headings, dates = dates)
}

# run function 1: breaks due to NULL return
purrr::map_dfr(urls, scrape_page)
#> Attempt number 2.
#> Attempt number 3.This is the last attempt, if it fails will return NULL
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> Error in UseMethod("xml_find_all"): no applicable method for 'xml_find_all' applied to an object of class "NULL"

function 2 - includes failsafe for 404s/NULL returns

scrape_page_safe <- function(url) {
  failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)

  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html")

  if (is.null(page_text)) {
    failsafe_tbl
  } else {
    headings <- page_text %>%
      rvest::html_nodes("h2") %>%
      rvest::html_text()

    dates <- page_text %>%
      rvest::html_nodes(".text-muted") %>%
      rvest::html_text() %>%
      stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

    dplyr::tibble(headings = headings, dates = dates)
  }
}

# run function 2: succeeds
purrr::map_dfr(urls, scrape_page_safe)
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~

function 3 - uses purrr::possibly with function 1 to handle errors

failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)
purrr::map_dfr(urls,
  possibly(          # return a failsafe on error
    scrape_page,
    otherwise = failsafe_tbl
  )
)
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~

^{Created on 2020-09-30 by the reprex package (v0.3.0)}

default rate limiting

I'm slowly starting to use scrape and was wondering whether the default rate should be a bit lower, e.g. 1 call every 5 seconds?

politely should use on.exit to restore HTTPUserAgent

It looks like politely calls options to set the user agent prior to running the wrapped function and then calls options again afterward to restore the previous value. However, if the wrapped function throws an error, the second call to options never happens and the user agent is not restored. I believe the correct way to handle this is with on.exit, e.g.:

old_ua <- getOption("HTTPUserAgent")
on.exit(options("HTTPUserAgent"= old_ua))
options("HTTPUserAgent"= user_agent)
res <- mem_fun(...)

bow return an error because of conflict with tidyverse

HI,

I work on the same subject (but not on cheese.com :p )
I was working to an implementation of R.cache (memoise seems better in this cas) before I see your pacakge :D

I have an error using bow please have a look to this :

library(polite)
bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases")
#> $handle
#> Host: https://en.wikipedia.org/wiki/List_of_cognitive_biases <NA>
#> 
#> $config
#> <request>
#> Options:
#> * autoreferer: 1
#> Headers:
#> * user-agent: polite R package - https://github.com/dmi3kno/polite
#> 
#> $url
#> [1] "https://en.wikipedia.org/wiki/List_of_cognitive_biases"
#> 
#> $back
#> character(0)
#> 
#> $forward
#> character(0)
#> 
#> $response
#> NULL
#> 
#> $html
#> <environment: 0x0000000017360310>
#> 
#> $user_agent
#> [1] "polite R package - https://github.com/dmi3kno/polite"
#> 
#> $domain
#> [1] "en.wikipedia.org"
#> 
#> $robotstxt
#> $domain
#> [1] "en.wikipedia.org"
#> 
#> $text
#> [1] "# robots.txt for http://www.wikipedia.org/ and friends\n#\n# Please note: There are a lot of pages on this site, and there are\n# some misbehaved spiders out there that go _way_ too fast. If you're\n# irresponsible, your access to the site may be blocked.\n#\n\n# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN\n# and ignoring 429 ratelimit responses, claims to respect robots:\n# http://mj12bot.com/\n\n[... 689 lines omitted ...]"
#> 
#> $bots
#> [1] "MJ12bot"                     "Mediapartners-Google*"      
#> [3] "IsraBot"                     "Orthogaffe"                 
#> [5] "UbiCrawler"                  "DOC"                        
#> [7] ""                            "[...  29 items omitted ...]"
#> 
#> $comments
#>   line
#> 1    1
#> 2    2
#> 3    3
#> 4    4
#> 5    5
#> 6    6
#> 7     
#> 8     
#>                                                                 comment
#> 1                # robots.txt for http://www.wikipedia.org/ and friends
#> 2                                                                     #
#> 3   # Please note: There are a lot of pages on this site, and there are
#> 4 # some misbehaved spiders out there that go _way_ too fast. If you're
#> 5              # irresponsible, your access to the site may be blocked.
#> 6                                                                     #
#> 7                                                                      
#> 8                                          [...  177 items omitted ...]
#> 
#> $permissions
#>                          field             useragent value
#> 1                     Disallow               MJ12bot     /
#> 2                     Disallow Mediapartners-Google*     /
#> 3                     Disallow               IsraBot      
#> 4                     Disallow            Orthogaffe      
#> 5                     Disallow            UbiCrawler     /
#> 6                     Disallow                   DOC     /
#> 7                                                         
#> 8 [...  443 items omitted ...]                            
#> 
#> $crawl_delay
#> [1] field     useragent value    
#> <0 lignes> (ou 'row.names' de longueur nulle)
#> 
#> $host
#> [1] field     useragent value    
#> <0 lignes> (ou 'row.names' de longueur nulle)
#> 
#> $sitemap
#> [1] field     useragent value    
#> <0 lignes> (ou 'row.names' de longueur nulle)
#> 
#> $other
#> [1] field     useragent value    
#> <0 lignes> (ou 'row.names' de longueur nulle)
#> 
#> $robexclobj
#> <Robots Exclusion Protocol Object>
#> $check
#> function (paths = "/", bot = "*") 
#> {
#>     spiderbar::can_fetch(obj = self$robexclobj, path = paths, 
#>         user_agent = bot)
#> }
#> <environment: 0x0000000017af4858>
#> 
#> attr(,"class")
#> [1] "robotstxt"
#> 
#> attr(,"class")
#> [1] "polite"  "session"
library(tidyverse)
#> Warning: le package 'tidyverse' a Ã©tÃ© compilÃ© avec la version R 3.4.4
#> -- Attaching packages ----------- tidyverse 1.2.1 --
#> v ggplot2 3.0.0     v purrr   0.2.5
#> v tibble  1.4.2     v dplyr   0.7.5
#> v tidyr   0.8.1     v stringr 1.3.1
#> v readr   1.1.1     v forcats 0.3.0
#> Warning: le package 'ggplot2' a Ã©tÃ© compilÃ© avec la version R 3.4.4
#> Warning: le package 'tidyr' a Ã©tÃ© compilÃ© avec la version R 3.4.4
#> Warning: le package 'readr' a Ã©tÃ© compilÃ© avec la version R 3.4.4
#> Warning: le package 'purrr' a Ã©tÃ© compilÃ© avec la version R 3.4.4
#> Warning: le package 'dplyr' a Ã©tÃ© compilÃ© avec la version R 3.4.4
#> Warning: le package 'stringr' a Ã©tÃ© compilÃ© avec la version R 3.4.4
#> Warning: le package 'forcats' a Ã©tÃ© compilÃ© avec la version R 3.4.4
#> -- Conflicts -------------- tidyverse_conflicts() --
#> x purrr::%||%()   masks polite::%||%()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
# library(rvest)

bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases")
#> <session> https://en.wikipedia.org/wiki/List_of_cognitive_biases
#> Error in UseMethod("status_code"): pas de mÃ©thode pour 'status_code' applicable pour un objet de classe "NULL"

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  French_France.1252          
#>  tz       Europe/Paris                
#>  date     2018-07-28
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                         
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.3)                 
#>  backports    1.1.2      2017-12-13 CRAN (R 3.4.3)                 
#>  base       * 3.4.3      2017-12-06 local                          
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.4.3)                 
#>  bindrcpp     0.2.2      2018-03-29 CRAN (R 3.4.4)                 
#>  broom        0.4.4      2018-03-29 CRAN (R 3.4.4)                 
#>  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.3)                 
#>  cli          1.0.0      2017-11-05 CRAN (R 3.4.3)                 
#>  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.3)                 
#>  compiler     3.4.3      2017-12-06 local                          
#>  crayon       1.3.4      2017-09-16 CRAN (R 3.4.4)                 
#>  curl         3.2        2018-03-28 CRAN (R 3.4.4)                 
#>  datasets   * 3.4.3      2017-12-06 local                          
#>  devtools     1.13.5     2018-02-18 CRAN (R 3.4.3)                 
#>  digest       0.6.15     2018-01-28 CRAN (R 3.4.3)                 
#>  dplyr      * 0.7.5      2018-05-19 CRAN (R 3.4.4)                 
#>  evaluate     0.10.1     2017-06-24 CRAN (R 3.4.4)                 
#>  forcats    * 0.3.0      2018-02-19 CRAN (R 3.4.4)                 
#>  foreign      0.8-70     2018-04-23 CRAN (R 3.4.4)                 
#>  ggplot2    * 3.0.0      2018-07-03 CRAN (R 3.4.4)                 
#>  glue         1.2.0      2017-10-29 CRAN (R 3.4.4)                 
#>  graphics   * 3.4.3      2017-12-06 local                          
#>  grDevices  * 3.4.3      2017-12-06 local                          
#>  grid         3.4.3      2017-12-06 local                          
#>  gtable       0.2.0      2016-02-26 CRAN (R 3.4.3)                 
#>  haven        1.1.1      2018-01-18 CRAN (R 3.4.3)                 
#>  hms          0.4.2      2018-03-10 CRAN (R 3.4.3)                 
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.3)                 
#>  httr         1.3.1      2017-08-20 CRAN (R 3.4.3)                 
#>  jsonlite     1.5        2017-06-01 CRAN (R 3.4.3)                 
#>  knitr        1.20       2018-02-20 CRAN (R 3.4.4)                 
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.4.3)                 
#>  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.3)                 
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.4.4)                 
#>  magrittr     1.5        2014-11-22 CRAN (R 3.4.4)                 
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.4.3)                 
#>  methods    * 3.4.3      2017-12-06 local                          
#>  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.1)                 
#>  modelr       0.1.2      2018-05-11 CRAN (R 3.4.4)                 
#>  munsell      0.4.3      2016-02-13 CRAN (R 3.4.3)                 
#>  nlme         3.1-137    2018-04-07 CRAN (R 3.4.4)                 
#>  parallel     3.4.3      2017-12-06 local                          
#>  pillar       1.2.2      2018-04-26 CRAN (R 3.4.4)                 
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.3)                 
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.4.3)                 
#>  polite     * 0.0.0.9001 2018-07-28 Github (dmi3kno/polite@34ad64f)
#>  psych        1.8.4      2018-05-06 CRAN (R 3.4.4)                 
#>  purrr      * 0.2.5      2018-05-29 CRAN (R 3.4.4)                 
#>  R6           2.2.2      2017-06-17 CRAN (R 3.4.3)                 
#>  ratelimitr   0.3.9      2017-06-02 CRAN (R 3.4.4)                 
#>  Rcpp         0.12.17    2018-05-18 CRAN (R 3.4.4)                 
#>  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.4)                 
#>  readxl       1.1.0      2018-04-20 CRAN (R 3.4.4)                 
#>  reshape2     1.4.3      2017-12-11 CRAN (R 3.4.3)                 
#>  rlang        0.2.1      2018-05-30 CRAN (R 3.4.4)                 
#>  rmarkdown    1.9        2018-03-01 CRAN (R 3.4.4)                 
#>  robotstxt    0.6.2      2018-07-18 CRAN (R 3.4.4)                 
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.3)                 
#>  rstudioapi   0.7        2017-09-07 CRAN (R 3.4.3)                 
#>  rvest        0.3.2      2016-06-17 CRAN (R 3.4.3)                 
#>  scales       0.5.0      2017-08-24 CRAN (R 3.4.3)                 
#>  spiderbar    0.2.1      2017-11-17 CRAN (R 3.4.4)                 
#>  stats      * 3.4.3      2017-12-06 local                          
#>  stringi      1.1.7      2018-03-12 CRAN (R 3.4.4)                 
#>  stringr    * 1.3.1      2018-05-10 CRAN (R 3.4.4)                 
#>  tibble     * 1.4.2      2018-01-22 CRAN (R 3.4.3)                 
#>  tidyr      * 0.8.1      2018-05-18 CRAN (R 3.4.4)                 
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.4.3)                 
#>  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.4)                 
#>  tools        3.4.3      2017-12-06 local                          
#>  triebeard    0.3.0      2016-08-04 CRAN (R 3.4.4)                 
#>  urltools     1.7.0      2018-01-20 CRAN (R 3.4.4)                 
#>  utils      * 3.4.3      2017-12-06 local                          
#>  withr        2.1.2      2018-03-15 CRAN (R 3.4.4)                 
#>  xml2         1.2.0      2018-01-24 CRAN (R 3.4.3)                 
#>  yaml         2.1.19     2018-05-01 CRAN (R 3.4.4)

Issue w/ Encoding while using Fedora

Recieved Warning while trying to Scrape text data from various websites.

Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
input string '^()(\s)#' cannot be translated to UTF-8, is it valid in
'ANSI_X3.4-1968'?

This error happens for any subsequent events after the first:

Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
restarting interrupted promise evaluation

jacobin_pull <- function(hyperlink) {
  session <- polite::bow(hyperlink)
  temp <- polite::scrape(session)
  text_data <-
    temp |>
    rvest::html_element(css = "#post-content") |>
    rvest::html_nodes("p") |>
    rvest::html_text2() |>
    dplyr::as_tibble() |>
    dplyr::rename(text = value) |>
  return(text_data)
}

jacobin_pull_try <- function(hyperlink) {
  tryCatch(
    expr = {
      message(paste("Trying", hyperlink))
      jacobin_pull(hyperlink)
    },
    error = function(cond) {
      message(paste("This URL has caused an error:", hyperlink))
      message(cond)
    },
    warning = function(cond) {
      message(paste("URL has a warning:", hyperlink))
      message(cond)
    },
    finally = {
      message(paste("Processed URL:", hyperlink))
    }
  )
}

jacobin_test_link <- "https://jacobin.com/2022/07/we-still-have-to-take-donald-trump-seriously"

jacobin_test_link_2 <- "https://jacobin.com/2022/07/ukraine-russia-war-debt-forgiveness-us-eu"

jacobin_test_3 <- "https://jacobin.com/2022/06/american-exceptionalism-off-the-rails"

jac_test <- jacobin_pull_try(jacobin_test_link)
jac_test_2 <- jacobin_pull_try(jacobin_test_link_2)
jac_test_3 <- jacobin_pull_try(jacobin_test_3)

Sys Environment:

R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 36 (Xfce)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.2

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

I tried to look into the source code to discover the issue but it's outside of my current understanding.

rvest::read_html() does not tigger the same error.

EDIT: Forgot to mention, ran the same code on windows and did not have the same issue.

Add repo topics?

Cf https://ropensci.github.io/dev_guide/grooming.html#repodiscoverability

You can use the same topics as https://github.com/ropensci/robotstxt

replace `httr::GET` with `httr::RETRY`

...and remove manual backout code. Thanks Chicago R Unconf for the tip in chircollab/chircollab20#1

first argument in nod

Maybe first argument in nod() should be an argument which could accept either a polite session object or url. Nod would create/modify polite session, but then you could like:

# don't run
url_list %>% 
   walk(~nod(.x) %>% rip()) #or scrape()

bow() missing default error in robotstxt

When running polite::bow(), I receive the following error:

Error in null_to_defeault(request$headers$`content-type`) : 
  argument "d" is missing, with no default

Reprex:
ons_bow <- polite::bow('https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland')

Sessioninfo:

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
 [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_0.3.5  xml2_1.3.2   polite_0.1.1

loaded via a namespace (and not attached):
 [1] here_0.1         digest_0.6.25    rprojroot_1.3-2  assertthat_0.2.1 R6_2.4.1         backports_1.1.7  magrittr_1.5    
 [8] httr_1.4.1       rlang_0.4.6      fs_1.4.1         robotstxt_0.7.4  ratelimitr_0.4.1 tools_3.6.3      glue_1.4.1      
[15] compiler_3.6.3   memoise_1.1.0    usethis_1.6.1

bow() timing out on several sites?

So all the examples on your webpage seem to work, but I have a list of sites to scrape, and it seems that the first 5 do fail, although the pages do seem to be scrapeable. Check the reprex:

x <- "www.csa.fr"
polite::bow(x,verbose = TRUE)
#> Error in curl::curl_fetch_memory(url, handle = handle): Failed to connect to  port 80: Timed out


xml2::read_html(curl::curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0")))
#> {html_document}
#> <html lang="fr">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="Body " style="background-image: url(/bundles/csasite/images/ ...


library(robotstxt)
paths_allowed(x)
#>  www.csa.fr
#> [1] TRUE

So the site DOES appear to be scrapeable, but the bow fails?

I've just installed the dev version of this package and all dependencies. My full session:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] robotstxt_0.7.13

loaded via a namespace (and not attached):
 [1] xfun_0.18          remotes_2.2.0      listenv_0.8.0     
 [4] vctrs_0.3.4        testthat_2.3.2     usethis_1.6.3     
 [7] htmltools_0.5.0    yaml_2.2.1         rlang_0.4.8       
[10] pkgbuild_1.1.0     pillar_1.4.6       glue_1.4.2        
[13] withr_2.3.0        sessioninfo_1.1.1  lifecycle_0.2.0   
[16] stringr_1.4.0      polite_0.1.1.9020  rvest_0.3.6       
[19] future_1.19.1      devtools_2.3.1     codetools_0.2-16  
[22] memoise_1.1.0      evaluate_0.14      knitr_1.29        
[25] callr_3.4.4        spiderbar_0.2.3    ps_1.3.4          
[28] curl_4.3           parallel_4.0.2     fansi_0.4.1       
[31] Rcpp_1.0.5         clipr_0.7.1        backports_1.1.10  
[34] desc_1.2.0         pkgload_1.1.0      fs_1.5.0          
[37] digest_0.6.25      stringi_1.5.3      processx_3.4.4    
[40] rprojroot_1.3-2    cli_2.0.2          tools_4.0.2       
[43] magrittr_1.5       tibble_3.0.3       crayon_1.3.4      
[46] whisker_0.4        future.apply_1.6.0 pkgconfig_2.0.3   
[49] ellipsis_0.3.1     xml2_1.3.2         prettyunits_1.1.1 
[52] reprex_0.3.0       assertthat_0.2.1   rmarkdown_2.3     
[55] httr_1.4.2         rstudioapi_0.11    ratelimitr_0.4.1  
[58] R6_2.4.1           globals_0.13.0     compiler_4.0.2

Set number of retry attempts

Is there any way to set the number of retry attempts? I'd like to be able to search a bunch of URLs to see if any return 404 errors. If they do, I don't need to retry the attempt at all. It would be totally fine just to move on but it seems like 3 attempts are always made even when I am expecting a failure.

Pathways not scraping / updating with nod()

I am trying to grab search results from coded form submissions for work. I am attempting the process on widely usable sites before application to our local library search page.

My question is how to handle search the results. For example, lets say I need to reference the moon for some reason:
Generate query results

library(rvest)
library(polite)  
library(tidyverse)

#establish the search engine where the form is located
gBow <-   bow("https://www.google.com/")

#fill out form
gSearchForm <- scrape(gBow) %>%
  html_node("form") %>%
  html_form() %>%
  set_values(q = "Moon")

#get results of query
results <- submit_form(gBow, gSearchForm, submit = "btnG")

Format and display query results

#display in browser works fine
resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()

#scrape results throws an error "No scraping allowed here"
scrape(results)

#Nod does not allow "access" to results page, which was my first thought
gSearchNod <- nod(gBow, resultsPath)
#Resulting session URL is still www.google.com, not the updated URL.
scrape(gSearchNod)

#And yet I can still navigate the results page with the rvest commands just fine
results %>% 
  follow_link("Moon - Wikipedia") %>% 
  html_node(".infobox") %>% 
  html_table(fill=T) %>% 
  select("Stat"=X1, "Dist"=X2) %>% 
  filter(Stat %in% c("Perigee", "Apogee"))

So, know lets try to eliminate a step, and query Wikipedia directly. Note the difference in performance in the display URL step, and otherwise overall similarity to the above problem.

#establish the search engine where the form is located
wikiBow <- bow("https://en.wikipedia.org/wiki/Main_Page")

#fill out form
wikiSearchForm <- scrape(wikiBow) %>%
  html_node("form") %>%
  html_form() %>%
  set_values(search = "Moon")

#get results of query
results <- submit_form(wikiBow, wikiSearchForm, submit = "fulltext")

Displaying in browser takes you to the internal search page, and does not forward you to the page named "wiki/Moon".

This isn't actually a problem, if I can parse this page just fine

resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()

#scrape results throws an error "No scraping allowed here", same error
scrape(results)

#Nod does not allow "access" to results page, which was my first thought
wSearchNod <- nod(wikiBow, resultsPath)
scrape(wSearchNod)

#And yet I can still navigate the results page with the rvest commands just fine
results %>% 
  follow_link("Moon") %>% 
  html_node(".infobox") %>% 
  html_table(fill=T) %>% 
  select("Stat"=X1, "Dist"=X2) %>% 
  filter(Stat %in% c("Perigee", "Apogee"))

I am reproducing the error reliably across search engines, including the internal one I am designing this for. I assume the error is related to permissions of results pages, but I can't seem to get a work around that isn't long coding URL's directly.

#the goal (functionally)

searchFunction <- function(searchTerm){
s <- bow("www.internalLibrary.com")

scrape(bow) %>% 
html_node("form") %>% 
html_form() %>% 
set_values(q = SearchTerm) %>%
submit_form(s, .) %>%
??????????????????????????????????????
nod() %>%
scrape() %>%
consider_life_choices_that_led_me_here() %>%
??????????????????????????????????????
html_node("#results") %>%
html_table() %>%
select("FY19" = cost, "Date" = approval) %>%
data.frame() %>%
return()
}

Any input would be appreciated.

Override robots.txt crawl-delay

Thanks for this package, it's much easier than following a best-practice guide.

Having said that, would you consider allowing best-practice to be overridden when the user-agent isn't dmi3kno?

I'm scraping GOV.UK, which has a robots.txt that only sets crawl-delay for a specific user agent, but polite applies the limit anyway. If you think this is a bug in the robotstxt then apologies, let me know and I'll report it there.

User-agent: *
Disallow: /*/print$
# Don't allow indexing of user needs pages
Disallow: /info/*
Sitemap: https://www.gov.uk/sitemap.xml
# https://ahrefs.com/robot/ crawls the site frequently
User-agent: AhrefsBot
Crawl-delay: 10
# https://www.deepcrawl.com/bot/ makes lots of requests. Ideally
# we'd slow it down rather than blocking it but it doesn't mention
# whether or not it supports crawl-delay.
User-agent: deepcrawl
Disallow: /
# Complaints of 429 'Too many requests' seem to be coming from SharePoint servers
# (https://social.msdn.microsoft.com/Forums/en-US/3ea268ed-58a6-4166-ab40-d3f4fc55fef4)
# The robot doesn't recognise its User-Agent string, see the MS support article:
# https://support.microsoft.com/en-us/help/3019711/the-sharepoint-server-crawler-ignores-directives-in-robots-txt
User-agent: MS Search 6.0 Robot
Disallow: /

Add re-try by default?

When status isn't 200, would it make sense to add re-try by default? E.g. up to 5 times with exponentially increasing waiting time?

no encoding supplied message

When I run bow() I get the following message:

No encoding supplied: defaulting to UTF-8.

It's not clear from the documentation if there is a way to suppress this message or to supply an encoding.

polite file download

either an adverb politely or a wrapper function for polite download of files (checking if the file can be downloaded, rate-limiting consecutive attempts, etc. )

discrepancy between rvest and bow + scrape

Hi there!

Great package! I'm just about to teach it in a class (hooray for responsible web scraping! :) ). I've run into this problem - I'm wondering if you happen to have a solution, or if you would prefer that I posted this on Stack Overflow?

Details below - let me know if you have any questions - and thanks again for this great package! 🎉

library(polite)
library(rvest)
#> Loading required package: xml2
site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print"
check_site <- bow(site, force = TRUE)
check_site
#> <polite session> http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print
#>      User-agent: polite R package - https://github.com/dmi3kno/polite
#>      robots.txt: 6 rules are defined for 1 bots
#>     Crawl delay: 15 sec
#>   The path is scrapable for this user-agent
scraped_site <- scrape(check_site)
#> Warning: Client error: (400) Bad Request http://
#> stats.espncricinfo.com/ci/engine/stats/index.html?
#> class=10%3Bpage%3D1%3Bteam%3D289%3Btemplate%3Dresults%3Btype%3Dbatting%3Bwrappertype%3Dprint
scraped_site
#> NULL

rvest_site <- read_html(site)
rvest_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...

^{Created on 2019-08-29 by the reprex package (v0.3.0)}

Cannot install package

Hello is this still available? I cannot install it and getting package not found on cran and wheni try to install from source I get an error ERROR: dependencies ‘here’, ‘ratelimitr’, ‘robotstxt’ are not available for package ‘polite’

attempt_get() and attempt_post()

Make generic polite functions for get and post with attempts and backout delays based on httr::RETRY.

Future ideas

Function for creating turn-key scraping bots within docker container: accepts url, how often it should be monitored, where to save results (DB, s3bucket).
“polite POST”?
can it be extended to enforcing polite behavior on server side, i.e. running alongside, say, plumber or shiny (other R operated web services)?
could there be python project based on requests, ratelimit, and memoization with functools.lru_cache
use_manners for API wrappers
memoize to disk
credentials management helpers, if any?

nod calls bow with wrong arguments

nod calls bow with the wrong arguments when the URL subdomain changes, causing it to throw an error.

Here's a short reproducible example:

url1 = 'https://essd.copernicus.org/articles/search.html'
url2 = 'https://seeker.copernicus.org/search.php?abstract=atmospheric+chemistry&startYear=2008&endYear=2020&paperVersion=final&journal=386&page=1'
bow(url1) %>% nod(url2)

polite isn't polite enough?

It seems my IP got banned for running the following code, even though I didn't see any contraindications in the bow()?

session <- bow("https://www.azlyrics.com/b/beatles.html", force = TRUE)
session

result <- scrape(session) 

mainPage <- result %>%
  html_nodes(".album , #listAlbum a")

df <- tibble(text = mainPage %>% html_text(),
       link = mainPage %>% html_attr("href")) %>% 
  ## albumnames don't have links, let's use this:
  mutate(album = ifelse(is.na(link),text, NA)) %>% 
  ## drag down from above:
  fill(album) %>% 
  ## and finally remove entries w/out link since we already have the album
  filter(!is.na(link)) %>% 
  ## repair the link
  mutate(link = gsub("\\.\\.", "https://www.azlyrics.com/", link))

lyricsGetter <- function(x){
  print(x)
  Sys.sleep(5)
  x %>% bow %>% scrape %>% html_nodes("br+ div") %>% 
  ## only need first row
  head(1) %>% html_text
}

sample_n(df, 200) %>% pull(link) %>% map_chr(lyricsGetter)

Did I mess something up? I'm even waiting 5 seconds as per the bow() output...

here's my sess:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rvest_0.3.2     xml2_1.2.0      polite_0.1.0    forcats_0.3.0   stringr_1.3.1  
 [6] dplyr_0.8.0.1   purrr_0.3.1     readr_1.3.1     tidyr_0.8.3     tibble_2.1.3   
[11] ggplot2_3.1.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        lubridate_1.7.4   here_0.1          lattice_0.20-38  
 [5] textshape_1.6.0   assertthat_0.2.0  rprojroot_1.3-2   digest_0.6.18    
 [9] utf8_1.1.4        R6_2.3.0          cellranger_1.1.0  plyr_1.8.4       
[13] backports_1.1.2   evaluate_0.10.1   httr_1.3.1        blogdown_0.8     
[17] pillar_1.3.1      rlang_0.4.0       curl_3.2          lazyeval_0.2.1   
[21] readxl_1.1.0      rstudioapi_0.10   data.table_1.11.8 Matrix_1.2-15    
[25] rmarkdown_1.11    tidytext_0.2.0    munsell_0.5.0     broom_0.5.2      
[29] compiler_3.5.2    janeaustenr_0.1.5 modelr_0.1.1      xfun_0.3         
[33] pkgconfig_2.0.2   htmltools_0.3.6   tidyselect_0.2.5  bookdown_0.7     
[37] fansi_0.4.0       crayon_1.3.4      withr_2.1.2       SnowballC_0.6.0  
[41] grid_3.5.2        nlme_3.1-137      jsonlite_1.6      gtable_0.2.0     
[45] magrittr_1.5      scales_1.0.0      tokenizers_0.2.1  cli_1.0.1        
[49] stringi_1.3.1     fs_1.2.6          robotstxt_0.6.2   syuzhet_1.0.4    
[53] ratelimitr_0.4.1  generics_0.0.2    tools_3.5.2       glue_1.3.0       
[57] hms_0.4.2         yaml_2.1.19       colorspace_1.3-2  memoise_1.1.0    
[61] knitr_1.20        haven_1.1.1       usethis_1.4.0

take scraping methods out of bow

Waiting for update in tarakc02/ratelimitr#15

error downloading multiple pages

I am trying to scrap some pages with the help of polite() and map(). But I am getting following error:

[[1]]
{xml_document}
Error in nchar(desc) : invalid multibyte string, element 2

And the instead of scrapping all pages in the giving range, it only scraps, the first page over and over for entire loop.

library(polite)
library (rvest)
library(purrr)

dawnsession <- bow("https://www.dawn.com")

dawnsession

dates <- seq(as.Date("2019-04-01"), as.Date("2019-04-30"), by="days")

fulllinks <- map(dates, ~scrape(dawnsession, params = paste0("archive/",.x)) )

links <- map(fulllinks, ~html_nodes(.x, ".mb-4") %>%
      
      html_nodes(".story__link") %>%
      
      html_attr("href"))

Release polite 0.1.0

Prepare for release:

Submit to CRAN:

usethis::use_version('major')
Update cran-comments.md
devtools::submit_cran()
Approve email

Wait for CRAN...

scrape not returning entire page like read_html

I apologize if this is just my lack of understanding and not an actual issue with polite. I previously wrote a script using rvest to scrape information from https://www.kcrw.com/music/shows/morning-becomes-eclectic and I wanted to adapt it to use polite to learn how it works. When I convert my code to use polite it doesn't work as I would expect.

This works when I use rvest::read_html()

library(rvest)
library(tidyverse)
A <- read_html("https://www.kcrw.com/music/shows/morning-becomes-eclectic/@@all-episodes?start=1")
tibble(dates = xml_nodes(A, ".date") %>% html_text(),
       subtitle = xml_nodes(A, ".subtitle a") %>% html_text())

But if I attempt to do the same thing with polite, then dates and subtitle are suddenly not the same length.

library(polite)
session <- bow("https://www.kcrw.com/music/shows/morning-becomes-eclectic/")
B <- scrape(session, params = "@@all-episodes?start=1")
tibble(dates = xml_nodes(B, ".date") %>% html_text(),
       subtitle = xml_nodes(B, ".subtitle a") %>% html_text())

I would have expected these to both work, and to work identically. Any idea what's going on?

randomized scrape delay

I'm working on a package and other scraping functions (not using polite) use randomized delays like Sys.sleep(rgamma(1, shape = 5, scale = 1/10)) and I'm wondering if it's possible to replicate that with polite. Is it possible to use set_scrape_delay() to set a randomized delay?

is_url function does not recognise urls with localhost

Hello everyone,

thanks for the great package! I stumbled into a little bug (or desired behavior that isn't documented): if you specifiy an url with localhost, you will get an error.

Maybe the regex in the is_url function needs to be refitted? This comparison of url regexes might help https://mathiasbynens.be/demo/url-regex

Cheers!

library(httr)
library(polite)

polite_POST <- politely(httr::POST, verbose=TRUE) 
polite_POST(url = "http://localhost:3001/api/v1/document")

Error: I can't find an argument containing url. Aborting.

polite_0.1.2
httr_1.4.5
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

No scraping allowed here... where to output that?

I've tested the bow and scrape functions on an URL I knew didn't allow webscraping (I knew it because I had used robotstxt on it two days ago) and I was wondering whether it'd make sense for bow to already output a message or warning when scraping is not allowed, maybe with a verbose argument? I'm wondering about it because I expected bow to tell me "Go away!". 😹

notok <- polite::bow("https://www.biodiversitylibrary.org/pageimage")
#> No encoding supplied: defaulting to UTF-8.
polite::scrape(notok)
#> No scraping allowed here!
#> NULL

Created on 2018-07-28 by the reprex package (v0.2.0).

Release polite 0.1.2

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Error in polite_download_file

Error appears in polite_download_file:

#> Error in destfile && !overwrite : invalid 'x' type in 'x && y'

Solution
Line in polite_download_file:

# today
  if(file.exists(destfile && !overwrite)){
# should be
  if(file.exists(destfile) && !overwrite){

Allow to pass encoding argument to httr::content?

See below:

home_url <- "https://www.jiscmail.ac.uk/cgi-bin/webadmin"

session <- polite::bow(home_url)



params <- "?A1=ind0703&L=ALLSTAT&F=&S=&O=T&H=0&D=0&T=0"

polite::scrape(session, params)
#> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : Input is not proper UTF-8, indicate encoding !
#> Bytes: 0xA3 0x33 0x31 0x20 [9]

page <- httr::GET("https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind0703&L=ALLSTAT&F=&S=&O=T&H=0&D=0&T=0")
httr::content(page)
#> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : Input is not proper UTF-8, indicate encoding !
#> Bytes: 0xA3 0x33 0x31 0x20 [9]
httr::content(page, encoding = "latin1")
#> {xml_document}
#> <html>
#> [1] <head>\n<title>JISCMail - ALLSTAT Archives - March 2007</title>\n<me ...
#> [2] <body onload="menuInitPosition()" onresize="menuInitPosition()" onmo ...

Created on 2018-07-31 by the reprex package (v0.2.0).

How about CRAN ?

Hi,

this 📦 is really useful. Do you plan to publish to CRAN or is it out of scrope ?

Thanks !

Polite doesn't seem to work

When I try to run the example code:

session <- bow("https://www.cheese.com/by_type", force = TRUE) result <- scrape(session, query=list(t="semi-soft", per_page=100)) %>% html_node("#main-body") %>% html_nodes("h3") %>% html_text()

I get this error:

Error in encl$_hash(c(encl$_f_hash, args, lapply(encl$_additional, :
oggetto "rlang_hash" non trovato