tidyverse / rvest Goto Github PK

View Code? Open in Web Editor NEW

1.5K 88.0 341.0 11.66 MB

Simple web scraping for R

Home Page: https://rvest.tidyverse.org

License: Other

R 97.84% HTML 2.16%

r web-scraping html

rvest's Introduction

rvest

Overview

rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.

Installation

# The easiest way to get rvest is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just rvest:
install.packages("rvest")

Usage

library(rvest)

# Start by reading a HTML page with read_html():
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

# Then find elements that match a css selector or XPath expression
# using html_elements(). In this example, each <section> corresponds
# to a different film
films <- starwars %>% html_elements("section")
films
#> {xml_nodeset (7)}
#> [1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ...
#> [2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ...
#> [3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ...
#> [4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ...
#> [5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ...
#> [6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ...
#> [7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ...

# Then use html_element() to extract one element per film. Here
# we the title is given by the text inside <h2>
title <- films %>% 
  html_element("h2") %>% 
  html_text2()
title
#> [1] "The Phantom Menace"      "Attack of the Clones"   
#> [3] "Revenge of the Sith"     "A New Hope"             
#> [5] "The Empire Strikes Back" "Return of the Jedi"     
#> [7] "The Force Awakens"

# Or use html_attr() to get data out of attributes. html_attr() always
# returns a string so we convert it to an integer using a readr function
episode <- films %>% 
  html_element("h2") %>% 
  html_attr("data-id") %>% 
  readr::parse_integer()
episode
#> [1] 1 2 3 4 5 6 7

If the page contains tabular data you can convert it directly to a data frame with html_table():

html <- read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")

html %>% 
  html_element(".tracklist") %>% 
  html_table()
#> # A tibble: 29 × 4
#>    No.   Title                       `Performer(s)`                       Length
#>    <chr> <chr>                       <chr>                                <chr> 
#>  1 1.    "\"Everything Is Awesome\"" "Tegan and Sara featuring The Lonel… 2:43  
#>  2 2.    "\"Prologue\""              ""                                   2:28  
#>  3 3.    "\"Emmett's Morning\""      ""                                   2:00  
#>  4 4.    "\"Emmett Falls in Love\""  ""                                   1:11  
#>  5 5.    "\"Escape\""                ""                                   3:26  
#>  6 6.    "\"Into the Old West\""     ""                                   1:00  
#>  7 7.    "\"Wyldstyle Explains\""    ""                                   1:21  
#>  8 8.    "\"Emmett's Mind\""         ""                                   2:17  
#>  9 9.    "\"The Transformation\""    ""                                   1:46  
#> 10 10.   "\"Saloons and Wagons\""    ""                                   3:38  
#> # ℹ 19 more rows

rvest's People

Contributors

Stargazers

Watchers

Forkers

renkun-ken ranaivosonherimanitra mcarlo timelyportfolio mmadsen afey briannafrederick lalock ssingh31 tedmcgavin dedream elombardi-cleve xalexander earino jamesblunt jankowtf jakeruss iamkbpark anandsrao itellin apollolrr eemaa26 craigcitro germangeogrpher wiltez ianreeve liamlee joeplyr thebennos guanlongtianzi ddcforkedrepos joserodriguezdiaz scgeeker bendalexis pkol damonzon mvmlima skybe077 1060460048 parthasen fototo marcds dwaynepaschall dannykugler d8aninja zeccalehn baaqmd aruizga7 ageek akshaynagpal charlesfueston elephann cpsievert dbbevan modulexcite mohitsh smc-dta victor-luu191 wildoane 99701a0554 jamslaugh ppr10 joshzyj panyibilly mrhelmus paulhendricks magnuson8 vadzimbelski-sciencesoft orenbochman jeffisabelle gshotwell gergness haveal lvyafei vj-ug jimhester thebearer696 pittacus songzhilian22 thatsmygithub codetasks johncollins yutannihilation sandeep433 3dan3 gouthami89 whizzalan jimmeister just4jc redwa lpysama etabeta78 bbrewington squirrelmaster jing-wei nachocab yrochat wanghz maclstar poissonfish

rvest's Issues

why no work?

senator.site = POST(
"http://bioguide.congress.gov/biosearch/biosearch1.asp",
body = list(
lastname = "",
firstname = "",
position = "Senator",
state = "",
party = "",
congress = "111"
),
multipart = F
)

senators = html(senator.site) %>% html_nodes("table") %>% html_table() %>% .[[2]]
this is ok.
but:
url = "http://bioguide.congress.gov/biosearch/biosearch.asp"
s = html_session(url)
f0 = html_form(s)
f1 = set_values(f0[[1]], position = "Senator", congress = 111)
test = submit_form(s, f1)
Error in vapply(elements, encode, character(1)) :
values must be length 1,
but FUN(X[[18]]) result is length 0

Demo "zillow" fails if package tidyr is missing

This is probably documented somewhere, but just in case it's not: the zillow demo will fail if package tidyr is not available:

Error in loadNamespace(name) : there is no package called ‘tidyr’

Very cool package, by the way!

R: harvesting data with rvest fail - consecutive html-forms

The advanced search from a railway timetable databas fails. This is only aviable after a foregoing form submit.

url     <- "http://mobile.bahn.de/bin/mobil/query.exe/dox?country=DEU&rt=1&use_realtime_filter=1&webview=&searchMode=NORMAL"
von     <- "HH"
nach    <- "F"
sitzung <- html_session(url)
p1.form <- html_form(sitzung)[[1]]
p2      <- submit_form(sitzung, p1.form, submit='advancedProductMode')
p2.form <- html_form(p2)[[1]]
form.mod<- set_values( p2.form
                      ,REQ0JourneyStopsS0G     = von
                      ,REQ0JourneyStopsZ0G     = nach
                      )
final   <- submit_form(p2, form.mod, submit='start')

Error in vapply(elements, encode, character(1)) :
values must be length 1,
but FUN(X[[18]]) result is length 0

Same result with:

submit_form(p2, form.mod, submit='start')

Resolving form URLs

The code for resolving URLs seems to not be working. Specifically the code in submit_form. There's the line

url <- XML::getRelativeURL(session$url, form$url)

but the signature for XML::getRelativeURL is

getRelativeURL(u, baseURL, sep = "/", addBase = TRUE, simplify = TRUE)

so the "base" URL should go second which means that the code should be

url <- XML::getRelativeURL(form$url, session$url)

Compare

XML::getRelativeURL("http://homepage.com", "form.php")
# [1] "http://homepage.com"
XML::getRelativeURL("form.php", "http://homepage.com")
# [1] "http://homepage.com/form.php"

html() is working...inconsistently

html("http://tools.ietf.org/html/rfc2070") produces:

NA

...which is weird, because I can see the page, CURL can see the page... but httr and rvest can't :/

Getting started with scraping CJSC stats

I have been trying to figure out a way to submit forms to

http://oag.ca.gov/crime/cjsc/stats/arrests

using RCurl and postForm but have been unsuccessful. This evening I stumbled upon rvest and since I am a big fan of plyr I thought I would give it a shot.

Unfortunately, since the package is relatively new there are not a ton of resources out there to work off of. If you take a look at the page source you can see that the second form is significantly more complicated than the examples I have been able to find online. Just curious if you might have any advice or thoughts on getting started.

Mass downloading in loop how overcome bad links in vector of possible links?

I am using your package to a loop of sort and in this large groups of url their is 1 or more Bad Request. Is there something I can do to override or skip a missing URL :

My warning is UseMethod("html_nodes") :
no applicable method for 'html_nodes' applied to an object of class "NULL"
In addition: Warning message:
In request_GET(session, url) : client error: (400) Bad Request
Called from top level.

Here is what I am using it for: Under line topics .... [51:100] in there among the 500 pages is one link that does not work. How to over ride. I know the method works using [1:50] . As there is no documentation for the session, any help would be great. Thannks
library(rvest)
library(magrittr)
library(reshape2)

get the topics here that are available for downloading

url_topics = "www.brainyquote.com/quotes/topics.html"
page_topics <- html_session(url_topics)

set the number at end of next line

topics <- page_topics %>% html_nodes(xpath='//div[@Class="bqLn"]/a') %>% html_text %>% .[51:100]

set the topics to an number as the pages are index topic1, topic2, topic3, etc..

topics = lapply(topics, function(src) {
paste0(src, seq(1:10))
})
topics <- as.vector(unlist(topics))

build loop to download 25 quotes for each topic in topics from page and then loop through 1..10

base = 'http://www.brainyquote.com/quotes/topics/topic_'
dfs2 <- lapply(topics, function(src) {
top = rep(paste0(src), times=25)
page <- html_session(paste0(base, src, '.html' ))
list(quote = page %>% html_nodes(xpath='//span[@Class="bqQuoteLink"]/a') %>% html_text() %>% .[1:25] ,
author = page %>% html_nodes(xpath='//div[@Class="bq-aut"]') %>% html_text() %>% .[1:25],
topic = top[1:25]
)
})

combine each file into master file

quotes <- do.call(rbind, dfs2)

Using a CSS selector in follow_link() does not work

Seems like so far you only implemented cases where i is either a numerical position index or a string that is matched against the href description/text, but not the case when its a CSS selector or XPath statement:

s <- html_session("http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
  httr::user_agent("Mozilla/5.0"))

## `i` is text that should be matched:
s %>% follow_link("Prime")

## `i` is valid selector:
try(s %>% follow_link("#result_0 .a-spacing-base+ .a-spacing-mini .a-spacing-none"))

## `i` is invalid selector:
s %>% follow_link("abcd")

delete file R/scrape-package.r?

I am using rvest for a package now and was looking thru the function list via rstudio and saw this:

scrape {rvest}  R Documentation
scrape.

Description

scrape.

It confused me. I looked at the code and tracked it back to R/scrape-package.r. Perhaps this file is deprecated and should be deleted now?

Wrong encoding detection

Hi.

rvest::html("http://winrus.com/cpage_r.htm")

Return a broken symbols but

XML::htmlParse("http://winrus.com/cpage_r.htm")

works without any additional actions.

Need better way to print html output

To emphasise structure over details.

Complex CSS selectors fail

Complex CSS selectors (as identified by SelectorGadget) often fail:

test_that("complex CSS", {
  expect_is(s <- html_session("http://testing-ground.scraping.pro/"), "session")
  expect_is(s %>% follow_link(".caseblock:nth-child(1) a", "css"), "session")
})

Save and load sessions

Needs to serialise cookies and history to disk

Add the ability to modify the UA

Enhancement idea: the ability to modify the utilised UA, to point to the actual scraper project, would be great.

testing shiny apps

I have been trying out rvest. Very nice! Question: Could rvest (e.g., set_values) be used to test shiny apps? Could it click a button and change tabs? As an example, could it change tabs from data > manage to data > explore @ http://vnijs.rady.ucsd.edu:3838/quant/

Tests for history navigation

including implementation of forward()

Support form finding/submission by fields

One feature of Perl's Mechanize module I have always appreciated is the ability to find and submit forms using only the forms fields.

I think it would be pretty simple to implement a filter function to do this.

Thanks for making this, I have done scraping with R a few times and doing some things (class selections) with only XPath is a bit clunky.

user agent string

Hi Hadley,

I was just wondering about which user agent string the html function passes to the scrapped website. Is there a way to set the user agent string to look like a real browser?

Thanks,

Alban

Fatal Error in html_table()

The title is pretty self-explanatory.

Just downloaded the package tonight, so it's the current version of rvest, on R 3.0.3. All dependencies up-to-date, AFAIK.

The error emerges when running the "bonds" example in the RDocs, and in my attempted usage.

The error is Error in FUN(X[[1L]], ...) : unused argument (trim = TRUE).

adding loop to get data from serialized pages

I used the code in a previous issue:

nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
html %>%
html_nodes("div.video-summary-data")

column <- function(x) nodes %>% html_node(xpath = x) %>% html_text()

df <- data.frame(
title = column("div[1]//a"),
author = column("div[3]//a"),
date = column("div[4]//small[1]"),
language = column("div[4]//small[2]"),
description = column("div[5]//p"),
stringsAsFactors = FALSE
)

and it worked great

but I need to get the data from a serial of pages of urls http://domain.com/post/X
where X is a series of numbers from 1 to 1000
how can i make a loop to get the data from these pages , each page as a new row

Complex XPath expressions fail

Complex CSS selectors (as identified by SelectorGadget) often fail:

test_that("Complex Xpath", {
  expect_is(s <- html_session("http://testing-ground.scraping.pro/"), "session")
  expect_is(s %>% follow_link(
    '//*[contains(concat( " ", @class, " " ), concat( " ", "caseblock", " " )) and (((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a',
      "xpath"), "session")
})

Error when no matching nodes

library("rvest")

myhtml <- html("http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21610")
myhtml %>% 
  html_nodes("tr:nth-child(23) .eye-protector-processed a")

One step form submission process

If you only supply name-value pairs (e.g. no form specifier), could find form that has all names.

Getting this error a bunch now: Error in stri_locate_first_regex(string, pattern, opts_regex = attr(pattern, : Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

Was testing this toy example but have gotten it a bunch lately when using html_nodes. Seems like this maybe coming from a dependancy but hoping someone knew of a fix. Here is the toy example not working using the newest rvest:

library(rvest) # devtools::install_github("rvest","hadley")
library(stringr) # install.pacakges("stringr")

url <- "http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data/page+%d"

trimNewline <- function(p) str_replace(p, "\n","")
asInteger <- function(x) as.integer(str_replace(x,",",""))
fromPercent <- function(x) as.numeric(str_replace(x, "%", ""))/100

table <- 1:3 %>%
lapply(function(page) {
nodes <- sprintf(url, page) %>%
html() %>%
html_nodes("table tbody tr")

    column <- function(xpath) nodes %>% html_node(xpath = xpath) %>% html_text(trim = TRUE)

    data.frame(
        rank = column("td[1]/div[1]/span") %>%
            str_replace("#(\\d+)(Tie)?","\\1") %>%
            as.integer,
        score = column("td[1]/span[1]/span") %>%
            str_replace("(\\d+) out of 100.","\\1") %>%
            asInteger,
        name = column("td[2]/a"),
        location = column("td[2]/p/text()[1]"),
        tuitionAndFees = column("td[3]/text()[1]") %>% trimNewline,
        totalEnrollment = column("td[4]/text()[1]") %>% asInteger,
        fall2013AcceptanceRate = column("td[5]/text()[1]") %>% fromPercent,
        averageFreshmanRetentionRate = column("td[6]/text()[1]") %>% fromPercent(),
        sixYearGraduationRate = column("td[7]/text()[1]") %>% fromPercent,
        stringsAsFactors = FALSE
    )
}) %>%
do.call(rbind, .)

Make example more robust

Version: 0.1.0
Check: examples, Result: ERROR
  Running examples in ‘rvest-Ex.R’ failed
  The error most likely occurred in:

  > ### Name: html_table
  > ### Title: Parse an html table into a data frame.
  > ### Aliases: html_table
  >
  > ### ** Examples
  >
  > bonds <- html("http://finance.yahoo.com/bonds/composite_bond_rates")
  Error in html.response(r, encoding = encoding) :
    server error: (502) Bad Gateway
  Calls: html ... html.character -> html -> html.response -> <Anonymous>
  Execution halted
See: <http://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-clang/rvest-00check.html>

Version: 0.1.0
Check: examples, Result: ERROR
  Running examples in ‘rvest-Ex.R’ failed
  The error most likely occurred in:

  > ### Name: html_table
  > ### Title: Parse an html table into a data frame.
  > ### Aliases: html_table
  >
  > ### ** Examples
  >
  > bonds <- html("http://finance.yahoo.com/bonds/composite_bond_rates")
  Error in html.response(r, encoding = encoding) :
    server error: (500) Internal Server Error
  Calls: html ... html.character -> html -> html.response -> <Anonymous>
  Execution halted
See: <http://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-gcc/rve
st-00check.html>,
     <http://www.r-project.org/nosvn/R.check/r-release-windows-ix86+x86_64/rvest-00check.html>

Easier way to scrape regular html data to data frame

Consider such a function, say, html_df() which is used to create data frame directly by specifying columns by css selector or xpath query respectively.

The function can be roughly written like (not fully implemented yet):

html_df <- function(x, columns, ...) {
  coldata <- lapply(columns, function(col) {
    nodes <- html_node(x, "css | xpath") ### not implemented
    html_text(nodes)
  })
  data.frame(coldata,...)
}

An example for http://pyvideo.org/category/50/pycon-us-2014:

library(rvest)
html("http://pyvideo.org/category/50/pycon-us-2014") %>%
  html_node("div.video-summary-data") %>%
  html_df(list(
    title = xpath("div[1]//a//text()"),
    author = xpath("div[3]//a//text()"),
    date = xpath("div[4]//small[1]//text()"),
    language = xpath("div[4]//small[2]//text()"),
    description = xpath("div[5]//p//text()")),
    stringsAsFactors = FALSE)

Here the result should be a data.frame in which each columns is specified by a selector either css or xpath, so that only one step can create a data frame from web data that is regular enough.

Figure out why encoding wrong

library(rvest)

elections <- html('http://www.elections.ca/content.aspx?section=res&dir=cir/list&document=index&lang=e#list', encoding = "iso-8859-1")

tables <- elections %>% 
  html_nodes("table")

# Extract captions
tables %>% html_node("caption") %>% html_text(trim = TRUE)

# Extract tables
tables %>% html_table()

from http://stackoverflow.com/questions/26920149/scraping-this-url-r-xml-and-getting-siblings

An example of scraping house listing data from Zillow

I created an example of scraping houselisting data from Zillow. The R code is here.

There is one place where I found a bit of trouble (described in lines 28-37 of code). Sometimes when a particular css is not there for all nodes, still I would like to extract all nodes with NAs for non-existing entries. In this instance, there are 25 house listings but only 24 have lot area. To enable combining the different attributes into a single data, it will be nice to output 25 lot areas (with NA for the non-existing one). I have a workaround in the code above. I will be interested to know if there is a better way of accomplishing this.

submit_form error

if submit_form with button like <button type='submit'> login </button> will cause error: Error in names(submits)[[1]] : subscript out of bounds.

the normal submit button like <input type="submit" />.

The bad thing is the submit button didn't have name attribute, so i can't specify the name in submit_form function.

any idea?

submit_form produces <url> malformed error

I had the problem that submit_form always produced the following error when trying to enter a specific web page:

Submitting with 'login'
Error in function (type, msg, asError = TRUE)  : <url> malformed

A couple of days ago someone posted the same issue on SO and the answer given by MrFlick solved my issue:

Before submitting the form you have to explicitly set the url of the login form.

It seems that rvest has some problems when interpreting absolute URLs without the server name.

Reproducible example (The other one can be found on SO):

library(rvest)
library(magrittr)

my_url = "https://www.openair.com/index.pl"
openair <- html_session(my_url)

login <-  html_form(openair) %>%
  extract2(1) %>%
  set_values(
    account_nickname = "does_not_matter_here",
    user_nickname = "does_not_matter_here",
    password = "does_not_matter_here"
  )

openair %<>% submit_form(login)

The code above produces the described error. Taking a look at the beginning of login:

<form> 'login_page' (POST /index.pl)
<input hidden> '_form_has_changed': 0
...

However, adding login$url <- 'https://www.openair.com/index.pl' before submitting the form solves it.

In this case the start of login looks like this:

<form> 'login_page' (POST https://www.openair.com/index.pl)
<input hidden> '_form_has_changed': 0

Scraping college rankings data?

Is it possible to scrape the college rankings data from this website
http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data
using rvest? Can you point me out to a tutorial or example that can get me started?

Thanks!

Think about XML

Rules to automatically distinguish xpath query from css selector

It is nice for rvest to support both css selector and xpath query in html_node().

Sometimes I must use xpath to query more complicated webpage. Currently I need to set xpath = TRUE to turn on xpath query.

I wonder would it be possible that the function use some smart ways to easily distinguish xpath query from css selector since they look very different.

Are there such rules?

Client error: (406) Not Acceptable

Hi,
I'm having problems parsing the following:

url <- 'http://www.cccc.cat/base-de-dades'
html(url)

raises a:

Error en html.response(r, encoding = encoding) : 
  client error: (406) Not Acceptable

I've checked the encoding with:

> guess_encoding(url)
    encoding language confidence
1 ISO-8859-1       es       0.65
2 ISO-8859-2       ro       0.56
3 ISO-8859-9       tr       0.37
4      UTF-8                0.10
5  Shift_JIS       ja       0.10
6    GB18030       zh       0.10
7     EUC-JP       ja       0.10
8     EUC-KR       ko       0.10
9       Big5       zh       0.10

but adding the encoding (any) explicitly does not work either

html(url, encoding = 'ISO-8859-1')

However, using the XML packgage

htmlParse(url)

Does work. Any ideas? I'd rather use rvest since I need to use html_session()to work with forms, and the same error persists.
Cheers,

install_github fails

The travis CI icon lists the build as passing but running install_github fails for me in the vignette code:

install_github("hadley/rvest")
Installing github repo rvest/master from hadley
Downloading master.zip from https://github.com/hadley/rvest/archive/master.zip
Installing package from /var/folders/19/tp3bd8zj3nn559nhlllgsyc457_rdd/T//RtmpVgtQSV/master.zip
arguments 'minimized' and 'invisible' are for Windows only
Installing rvest
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD build
'/private/var/folders/19/tp3bd8zj3nn559nhlllgsyc457_rdd/T/RtmpVgtQSV/devtools124fb58396f88/rvest-master' --no-manual --no-resave-data

checking for file '/private/var/folders/19/tp3bd8zj3nn559nhlllgsyc457_rdd/T/RtmpVgtQSV/devtools124fb58396f88/rvest-master/DESCRIPTION' ... OK
preparing 'rvest':
checking DESCRIPTION meta-information ... OK
installing the package to build vignettes
creating vignettes ... ERROR
Error: processing vignette 'selectorgadget.Rmd' failed with diagnostics:
unrecognized fields specified in html_dependency: attachment
Execution halted
Error: Command failed (1)

When I run update.packages() all packages are up to date. Here is my output from R.Version():

R.Version()
$platform
[1] "x86_64-apple-darwin10.8.0"

$arch
[1] "x86_64"

$os
[1] "darwin10.8.0"

$system
[1] "x86_64, darwin10.8.0"

$status
[1] ""

$major
[1] "3"

$minor
[1] "0.2"

$year
[1] "2013"

$month
[1] "09"

$day
[1] "25"

$svn rev
[1] "63987"

$language
[1] "R"

$version.string
[1] "R version 3.0.2 (2013-09-25)"

$nickname
[1] "Frisbee Sailing"

html_form / parse_options error

Hey Hadley,

I noticed that the html_form function in some case returns the error:
Error in as.vector(x, "list") :
cannot coerce type 'externalptr' to vector of type 'list'

I started debugging a bit and I noticed that the error is returned during the evaluation of:

parsed <- lapply(options, parse_option)

inside the parse_options() function.

The error happens when the scraped web page contains a drop-down menu.
For instance, running html_form on the web page http://www.echoecho.com/htmlforms11.htm will return the error because of the html form:

<select name="dropdownmenu" size="1">
<option value="Butter">Butter</option>
<option value="Cheese">Cheese</option>
<option value="Milk">Milk</option>
</select>

I fixed the error changing

parsed <- lapply(options, parse_option)

into

 if(length(options) == 1)
parsed <- list(parse_option(options))
else
parsed <- lapply(options, parse_option)

but this is not very elegant

Cheers
Lorenzo

demo/tripadvisor.R code line 33 question

I just started working through the demo examples. First of all, thank you for starting to create this package. It looks awesome since the syntax is easy to understand and use.

In demo/tripadvisor.R code line 33 is:

review <- reviews %>%
  html_node(".partial_entry") %>%
  html_text()

Sometimes the .partial entry node has both the review and a response from the manager. So though the number of reviews in this case is 10, it extracts 19 entries. The following code works

review <- reviews %>% html_node("//p[@class='partial_entry'][1]",xpath=TRUE)%>%html_text()

Is there a way to write the above line without using xpath.

Regards,
Shankar

Recv failure: Connection reset by peer

trying to scrape http://www.nytimes.com/

html_session('http://www.nytimes.com/', verbose(), user_agent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.1.25 (KHTML, like Gecko) Version/8.0 Safari/600.1.25'))

but got the error

Error in function (type, msg, asError = TRUE)  : 
  Recv failure: Connection reset by peer

i check http://stackoverflow.com/questions/10285700/curl-error-recv-failure-connection-reset-by-peer-php-curl this issue, but it seems that there is no relation with this.

Significant memory leak on Windows

Hi Hadley,

I'm perfectly aware that this is neither cause by httr nor rvest, but the XML package. Nonetheless I thought I'd try my luck raising your attention as this bug has been around for years now and it also affects your packages: omegahat/XML#4.

Thanks a lot should you consider looking into this!

Error with html_node vs html_nodes

When using rvest to scrape data from a web page, sometimes it is useful to use a number of selectors to get at specific data. For example, given the page:

http://www.sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva

I want to extract the href attributes, of the anchor, in the table datas, of the table fightfinder_result. After discussing it with Hadley on twitter, he suggested:

html("http://sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva")  %>% html_nodes(".fightfinder_result tr") %>% html_node("a") %>% html_attr("href")

However, this failed with:

Error in UseMethod("xmlAttrs", node) : 
   no applicable method for 'xmlAttrs' applied to an object of class "NULL"

When I replaced html_node("a") with html_nodes("a"), it gave the proper results:

> html("http://sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva") %>%     html_nodes(".fightfinder_result tr") %>% html_nodes("a") %>% html_attr("href")
 [1] "/fighter/Anderson-Silva-1356"                   
 [2] "/fighter/Wanderson-Silva-13585"                 
 [3] "/fighter/Anderson-da-Silva-132861"              
 [4] "/fighter/Anderson-Pires-da-Silva-22925"         
 [5] "/fighter/Anderson-Silva-136541"                 
 [6] "/fighter/Janderson-Rodrigues-Silva-141103"      
 [7] "/fighter/Wanderson-Michel-de-Jesus-Silva-149637"
 [8] "/fighter/Anderson-Silva-167443"                 
 [9] "/fighter/Anderson-Silva-169715"                 
[10] "/fighter/Wanderson-Pantoja-da-Silva-172721"

When I look at the str of both of these, they are both Lists with - attr(*, "class")= chr "XMLNodeSet", however html_node() returns a NULL as the first element (of an 11 item list) whereas html_nodes() returns a list of 10 items, all valid.

With html_node:

 > str(html("http://sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva") %>%      html_nodes(".fightfinder_result tr") %>% html_node("a"))
 List of 11
 $ : NULL
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
  - attr(*, "class")= chr "XMLNodeSet"

vs html_nodes:

> str(html("http://sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva") %>% html_nodes(".fightfinder_result tr") %>% html_nodes("a"))
List of 10
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 - attr(*, "class")= chr "XMLNodeSet"

I'm sorry now you hear about MMA, btw.

Not able to figure out an error while using submit_form()

I was trying a variation of demo/united.R. I wanted to mimic the action of going to united.com and checking flight status of flights by specifying origin and destination.

The code is:

library(rvest)
library(magrittr)

united <- html_session("http://www.united.com")

fhist=united%>%html_node("form")%>%extract2(1)%>%html_form()%>%
  set_values(
    'ctl00$ContentInfo$Checkinflightstatus$Origin$txtOrigin'='IND',
    'ctl00$ContentInfo$Checkinflightstatus$Destination$txtDestination'='ORD'
  )

fhist2=united%>%submit_form(fhist,'ctl00$ContentInfo$Checkinflightstatus$btnFlightStatus')

The error I am getting from the last line is

fhist2=united%>%submit_form(fhist,'ctl00$ContentInfo$Checkinflightstatus$btnFlightStatus')
Error in vapply(elements, encode, character(1)) : 
  values must be length 1,
 but FUN(X[[26]]) result is length 0

I would appreciate any thoughts on why this happens

bug in submit_form?

> url <- 'http://eagletreas.mohavecounty.us/treasurer/web/login.jsp'
> #  Form 1, the submit url is `../web/loginPOST.jsp`, session url is http://eagletreas.mohavecounty.us/treasurer/web/login.jsp
> s <- html_session(url)
> form <- html(s) %>% html_form() %>% .[[1]]

> XML::getRelativeURL(s$url, form$url)
[1] "http://eagletreas.mohavecounty.us/treasurer/web/login.jsp"
> XML::getRelativeURL(form$url, s$url)
[1] "http://eagletreas.mohavecounty.us/treasurer/web/loginPOST.jsp"

XML::getRelativeURL(form$url, s$url) is the right way to get relative url.

encoding issue

i try to scrape a webpage http://www3.boj.or.jp/market/jp/stat/of141205.htm

require(rvest)
url='http://www3.boj.or.jp/market/jp/stat/of141205.htm'

# bad, return string like: I�t�@�[ (12��5�ú���à��)
html(url, encoding='utf-8') %>% html_nodes('title') %>% html_text()
html(url, encoding='SHIFT_JIS') %>% html_nodes('title') %>% html_text()

# good, return: オファー (12月5日＜金＞)
html(readLines(url, encoding='utf-8')) %>% html_nodes('title') %>% html_text()

what is the difference between `html` and `readLines` in deal with encoding?

typo in error message

I just got this error message in rvest:

"Error: Table doesn't has different numbers of columns in different rows. Do you want fill = TRUE?"

Demo idea

http://www.gregreda.com/2014/07/27/scraping-craigslist-for-tickets/

Warning message: failed to parse headers

When running the demos I always get the following warning:

In addition: Warning message:
Failed to parse headers:
Vary:User-Agent

What could I do to prevent these from occuring? Setting user agent info via httr::user_agent?

Eliminate XML dependency

By binding to libxml directly (or use alternative package)

Discrepancy of results between using XML and rvest

I am not sure what I am doing wrong but when I scrape a page from realtor.com, I am not getting the full list with rvest. I have listed my R code below.


#
# Scrape data on house listings from realtor.com
#

# set working directory
setwd("~/notesofdabbler/Rspace/dayoh_housing/")

# load libraries
library(rvest)
library(XML)


#
# Search URL with following filter applied
#3+ bedrooms, 2+ baths, 1800+ sqft, 0-20 years old
#
srchurl="http://www.realtor.com/realestateandhomes-search/Centerville_OH/beds-3/baths-2/sqft-8/pfbm-10/show-hide-pending"

# using XML library
housedoc=htmlTreeParse(srchurl,useInternalNodes=TRUE)
ns_id=getNodeSet(housedoc,"//ul[@class='listing-summary']//li[@class='listing-location']//a[@href]") 
id=sapply(ns_id,function(x) xmlAttrs(x)["href"])
id

# using rvest library
housedoc = html(srchurl) 
houselist = housedoc %>% html_node(".listing-summary")
id =  houselist %>% html_node(".listing-location a") %>% html_attr("href")
id

The actual run version of the code with output is here.

html_table() doesn't handle some things that readHTMLTable does

e.g.

library(rvest)
library(XML)
url <- "http://data.fis-ski.com/dynamic/results.html?sector=CC&raceid=22395"

tbls1 <- html_table(html(url)) #Error: Table doesn't have equal number of columns in every row
tbls2 <- readHTMLTable(doc = url) #Fills in with NAs

I'm not convinced that what readHTMLTable does is optimal, but it would be nice to be able to at least get some output from tables like this.

Maybe it would make more sense to return a list with each piece of the table that is "complete", rather than filling with NAs? Just a thought...

Suggestion: "offline-mode" for html_session() and derived functionality

The idea came up while testing follow_link() as I tried to find a way to base testing on offline content:
How about making html_session() and things like follow_link() "offline-ready"?

Not sure if it would actually be that useful outside an unit testing context, but maybe it's something to keep in mind?

See https://github.com/rappster/rvest/blob/follow_link/tests/testthat/test-html_session.R#L22