Git Product home page Git Product logo

rvest's Introduction

rvest rvest website

CRAN status R-CMD-check Codecov test coverage

Overview

rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.

Installation

# The easiest way to get rvest is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just rvest:
install.packages("rvest")

Usage

library(rvest)

# Start by reading a HTML page with read_html():
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

# Then find elements that match a css selector or XPath expression
# using html_elements(). In this example, each <section> corresponds
# to a different film
films <- starwars %>% html_elements("section")
films
#> {xml_nodeset (7)}
#> [1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ...
#> [2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ...
#> [3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ...
#> [4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ...
#> [5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ...
#> [6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ...
#> [7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ...

# Then use html_element() to extract one element per film. Here
# we the title is given by the text inside <h2>
title <- films %>% 
  html_element("h2") %>% 
  html_text2()
title
#> [1] "The Phantom Menace"      "Attack of the Clones"   
#> [3] "Revenge of the Sith"     "A New Hope"             
#> [5] "The Empire Strikes Back" "Return of the Jedi"     
#> [7] "The Force Awakens"

# Or use html_attr() to get data out of attributes. html_attr() always
# returns a string so we convert it to an integer using a readr function
episode <- films %>% 
  html_element("h2") %>% 
  html_attr("data-id") %>% 
  readr::parse_integer()
episode
#> [1] 1 2 3 4 5 6 7

If the page contains tabular data you can convert it directly to a data frame with html_table():

html <- read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")

html %>% 
  html_element(".tracklist") %>% 
  html_table()
#> # A tibble: 29 × 4
#>    No.   Title                       `Performer(s)`                       Length
#>    <chr> <chr>                       <chr>                                <chr> 
#>  1 1.    "\"Everything Is Awesome\"" "Tegan and Sara featuring The Lonel… 2:43  
#>  2 2.    "\"Prologue\""              ""                                   2:28  
#>  3 3.    "\"Emmett's Morning\""      ""                                   2:00  
#>  4 4.    "\"Emmett Falls in Love\""  ""                                   1:11  
#>  5 5.    "\"Escape\""                ""                                   3:26  
#>  6 6.    "\"Into the Old West\""     ""                                   1:00  
#>  7 7.    "\"Wyldstyle Explains\""    ""                                   1:21  
#>  8 8.    "\"Emmett's Mind\""         ""                                   2:17  
#>  9 9.    "\"The Transformation\""    ""                                   1:46  
#> 10 10.   "\"Saloons and Wagons\""    ""                                   3:38  
#> # ℹ 19 more rows

rvest's People

Contributors

batpigandme avatar bbrewington avatar craigcitro avatar cwickham avatar david-jankoski avatar dholstius avatar dmi3kno avatar dpprdan avatar earino avatar epiben avatar hadley avatar hmalmedal avatar jimhester avatar jjchern avatar jl5000 avatar johncollins avatar joshualeond avatar jrnold avatar marcinkosinski avatar mattcowgill avatar michaelchirico avatar moodymudskipper avatar renkun-ken avatar rentrop avatar rjpat avatar sfirke avatar vtroost avatar wildoane avatar yutannihilation avatar zwael avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rvest's Issues

why no work?

senator.site = POST(
"http://bioguide.congress.gov/biosearch/biosearch1.asp",
body = list(
lastname = "",
firstname = "",
position = "Senator",
state = "",
party = "",
congress = "111"
),
multipart = F
)

senators = html(senator.site) %>% html_nodes("table") %>% html_table() %>% .[[2]]
this is ok.
but:
url = "http://bioguide.congress.gov/biosearch/biosearch.asp"
s = html_session(url)
f0 = html_form(s)
f1 = set_values(f0[[1]], position = "Senator", congress = 111)
test = submit_form(s, f1)
Error in vapply(elements, encode, character(1)) :
values must be length 1,
but FUN(X[[18]]) result is length 0

Demo "zillow" fails if package tidyr is missing

This is probably documented somewhere, but just in case it's not: the zillow demo will fail if package tidyr is not available:

Error in loadNamespace(name) : there is no package called ‘tidyr’

Very cool package, by the way!

R: harvesting data with rvest fail - consecutive html-forms

The advanced search from a railway timetable databas fails. This is only aviable after a foregoing form submit.

url     <- "http://mobile.bahn.de/bin/mobil/query.exe/dox?country=DEU&rt=1&use_realtime_filter=1&webview=&searchMode=NORMAL"
von     <- "HH"
nach    <- "F"
sitzung <- html_session(url)
p1.form <- html_form(sitzung)[[1]]
p2      <- submit_form(sitzung, p1.form, submit='advancedProductMode')
p2.form <- html_form(p2)[[1]]
form.mod<- set_values( p2.form
                      ,REQ0JourneyStopsS0G     = von
                      ,REQ0JourneyStopsZ0G     = nach
                      )
final   <- submit_form(p2, form.mod, submit='start')

Error in vapply(elements, encode, character(1)) :
values must be length 1,
but FUN(X[[18]]) result is length 0

Same result with:

submit_form(p2, form.mod, submit='start')

See also:

http://stackoverflow.com/posts/27251705/revisions

Resolving form URLs

The code for resolving URLs seems to not be working. Specifically the code in submit_form. There's the line

url <- XML::getRelativeURL(session$url, form$url)

but the signature for XML::getRelativeURL is

getRelativeURL(u, baseURL, sep = "/", addBase = TRUE, simplify = TRUE)

so the "base" URL should go second which means that the code should be

url <- XML::getRelativeURL(form$url, session$url)

Compare

XML::getRelativeURL("http://homepage.com", "form.php")
# [1] "http://homepage.com"
XML::getRelativeURL("form.php", "http://homepage.com")
# [1] "http://homepage.com/form.php"

Getting started with scraping CJSC stats

I have been trying to figure out a way to submit forms to

http://oag.ca.gov/crime/cjsc/stats/arrests

using RCurl and postForm but have been unsuccessful. This evening I stumbled upon rvest and since I am a big fan of plyr I thought I would give it a shot.

Unfortunately, since the package is relatively new there are not a ton of resources out there to work off of. If you take a look at the page source you can see that the second form is significantly more complicated than the examples I have been able to find online. Just curious if you might have any advice or thoughts on getting started.

Mass downloading in loop how overcome bad links in vector of possible links?

I am using your package to a loop of sort and in this large groups of url their is 1 or more Bad Request. Is there something I can do to override or skip a missing URL :

My warning is UseMethod("html_nodes") :
no applicable method for 'html_nodes' applied to an object of class "NULL"
In addition: Warning message:
In request_GET(session, url) : client error: (400) Bad Request
Called from top level.

Here is what I am using it for: Under line topics .... [51:100] in there among the 500 pages is one link that does not work. How to over ride. I know the method works using [1:50] . As there is no documentation for the session, any help would be great. Thannks
library(rvest)
library(magrittr)
library(reshape2)

get the topics here that are available for downloading

url_topics = "www.brainyquote.com/quotes/topics.html"
page_topics <- html_session(url_topics)

set the number at end of next line

topics <- page_topics %>% html_nodes(xpath='//div[@Class="bqLn"]/a') %>% html_text %>% .[51:100]

set the topics to an number as the pages are index topic1, topic2, topic3, etc..

topics = lapply(topics, function(src) {
paste0(src, seq(1:10))
})
topics <- as.vector(unlist(topics))

build loop to download 25 quotes for each topic in topics from page and then loop through 1..10

base = 'http://www.brainyquote.com/quotes/topics/topic_'
dfs2 <- lapply(topics, function(src) {
top = rep(paste0(src), times=25)
page <- html_session(paste0(base, src, '.html' ))
list(quote = page %>% html_nodes(xpath='//span[@Class="bqQuoteLink"]/a') %>% html_text() %>% .[1:25] ,
author = page %>% html_nodes(xpath='//div[@Class="bq-aut"]') %>% html_text() %>% .[1:25],
topic = top[1:25]
)
})

combine each file into master file

quotes <- do.call(rbind, dfs2)

Using a CSS selector in follow_link() does not work

Seems like so far you only implemented cases where i is either a numerical position index or a string that is matched against the href description/text, but not the case when its a CSS selector or XPath statement:

s <- html_session("http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
  httr::user_agent("Mozilla/5.0"))

## `i` is text that should be matched:
s %>% follow_link("Prime")

## `i` is valid selector:
try(s %>% follow_link("#result_0 .a-spacing-base+ .a-spacing-mini .a-spacing-none"))

## `i` is invalid selector:
s %>% follow_link("abcd")

delete file R/scrape-package.r?

I am using rvest for a package now and was looking thru the function list via rstudio and saw this:

scrape {rvest}  R Documentation
scrape.

Description

scrape.

It confused me. I looked at the code and tracked it back to R/scrape-package.r. Perhaps this file is deprecated and should be deleted now?

Wrong encoding detection

Hi.

rvest::html("http://winrus.com/cpage_r.htm")

Return a broken symbols but

XML::htmlParse("http://winrus.com/cpage_r.htm")

works without any additional actions.

Complex CSS selectors fail

Complex CSS selectors (as identified by SelectorGadget) often fail:

test_that("complex CSS", {
  expect_is(s <- html_session("http://testing-ground.scraping.pro/"), "session")
  expect_is(s %>% follow_link(".caseblock:nth-child(1) a", "css"), "session")
})

testing shiny apps

I have been trying out rvest. Very nice! Question: Could rvest (e.g., set_values) be used to test shiny apps? Could it click a button and change tabs? As an example, could it change tabs from data > manage to data > explore @ http://vnijs.rady.ucsd.edu:3838/quant/

Support form finding/submission by fields

One feature of Perl's Mechanize module I have always appreciated is the ability to find and submit forms using only the forms fields.

I think it would be pretty simple to implement a filter function to do this.

Thanks for making this, I have done scraping with R a few times and doing some things (class selections) with only XPath is a bit clunky.

user agent string

Hi Hadley,

I was just wondering about which user agent string the html function passes to the scrapped website. Is there a way to set the user agent string to look like a real browser?

Thanks,

Alban

Fatal Error in html_table()

The title is pretty self-explanatory.

Just downloaded the package tonight, so it's the current version of rvest, on R 3.0.3. All dependencies up-to-date, AFAIK.

The error emerges when running the "bonds" example in the RDocs, and in my attempted usage.

The error is Error in FUN(X[[1L]], ...) : unused argument (trim = TRUE).

adding loop to get data from serialized pages

I used the code in a previous issue:

nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
html %>%
html_nodes("div.video-summary-data")

column <- function(x) nodes %>% html_node(xpath = x) %>% html_text()

df <- data.frame(
title = column("div[1]//a"),
author = column("div[3]//a"),
date = column("div[4]//small[1]"),
language = column("div[4]//small[2]"),
description = column("div[5]//p"),
stringsAsFactors = FALSE
)

and it worked great

but I need to get the data from a serial of pages of urls http://domain.com/post/X
where X is a series of numbers from 1 to 1000
how can i make a loop to get the data from these pages , each page as a new row

Complex XPath expressions fail

Complex CSS selectors (as identified by SelectorGadget) often fail:

test_that("Complex Xpath", {
  expect_is(s <- html_session("http://testing-ground.scraping.pro/"), "session")
  expect_is(s %>% follow_link(
    '//*[contains(concat( " ", @class, " " ), concat( " ", "caseblock", " " )) and (((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a',
      "xpath"), "session")
})

Error when no matching nodes

library("rvest")

myhtml <- html("http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21610")
myhtml %>% 
  html_nodes("tr:nth-child(23) .eye-protector-processed a")

Getting this error a bunch now: Error in stri_locate_first_regex(string, pattern, opts_regex = attr(pattern, : Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

Was testing this toy example but have gotten it a bunch lately when using html_nodes. Seems like this maybe coming from a dependancy but hoping someone knew of a fix. Here is the toy example not working using the newest rvest:

library(rvest) # devtools::install_github("rvest","hadley")
library(stringr) # install.pacakges("stringr")

url <- "http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data/page+%d"

trimNewline <- function(p) str_replace(p, "\n","")
asInteger <- function(x) as.integer(str_replace(x,",",""))
fromPercent <- function(x) as.numeric(str_replace(x, "%", ""))/100

table <- 1:3 %>%
lapply(function(page) {
nodes <- sprintf(url, page) %>%
html() %>%
html_nodes("table tbody tr")

    column <- function(xpath) nodes %>% html_node(xpath = xpath) %>% html_text(trim = TRUE)

    data.frame(
        rank = column("td[1]/div[1]/span") %>%
            str_replace("#(\\d+)(Tie)?","\\1") %>%
            as.integer,
        score = column("td[1]/span[1]/span") %>%
            str_replace("(\\d+) out of 100.","\\1") %>%
            asInteger,
        name = column("td[2]/a"),
        location = column("td[2]/p/text()[1]"),
        tuitionAndFees = column("td[3]/text()[1]") %>% trimNewline,
        totalEnrollment = column("td[4]/text()[1]") %>% asInteger,
        fall2013AcceptanceRate = column("td[5]/text()[1]") %>% fromPercent,
        averageFreshmanRetentionRate = column("td[6]/text()[1]") %>% fromPercent(),
        sixYearGraduationRate = column("td[7]/text()[1]") %>% fromPercent,
        stringsAsFactors = FALSE
    )
}) %>%
do.call(rbind, .)

Make example more robust

Version: 0.1.0
Check: examples, Result: ERROR
  Running examples inrvest-Ex.Rfailed
  The error most likely occurred in:

  > ### Name: html_table
  > ### Title: Parse an html table into a data frame.
  > ### Aliases: html_table
  >
  > ### ** Examples
  >
  > bonds <- html("http://finance.yahoo.com/bonds/composite_bond_rates")
  Error in html.response(r, encoding = encoding) :
    server error: (502) Bad Gateway
  Calls: html ... html.character -> html -> html.response -> <Anonymous>
  Execution halted
See: <http://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-clang/rvest-00check.html>

Version: 0.1.0
Check: examples, Result: ERROR
  Running examples inrvest-Ex.Rfailed
  The error most likely occurred in:

  > ### Name: html_table
  > ### Title: Parse an html table into a data frame.
  > ### Aliases: html_table
  >
  > ### ** Examples
  >
  > bonds <- html("http://finance.yahoo.com/bonds/composite_bond_rates")
  Error in html.response(r, encoding = encoding) :
    server error: (500) Internal Server Error
  Calls: html ... html.character -> html -> html.response -> <Anonymous>
  Execution halted
See: <http://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-gcc/rve
st-00check.html>,
     <http://www.r-project.org/nosvn/R.check/r-release-windows-ix86+x86_64/rvest-00check.html>

Easier way to scrape regular html data to data frame

Consider such a function, say, html_df() which is used to create data frame directly by specifying columns by css selector or xpath query respectively.

The function can be roughly written like (not fully implemented yet):

html_df <- function(x, columns, ...) {
  coldata <- lapply(columns, function(col) {
    nodes <- html_node(x, "css | xpath") ### not implemented
    html_text(nodes)
  })
  data.frame(coldata,...)
}

An example for http://pyvideo.org/category/50/pycon-us-2014:

library(rvest)
html("http://pyvideo.org/category/50/pycon-us-2014") %>%
  html_node("div.video-summary-data") %>%
  html_df(list(
    title = xpath("div[1]//a//text()"),
    author = xpath("div[3]//a//text()"),
    date = xpath("div[4]//small[1]//text()"),
    language = xpath("div[4]//small[2]//text()"),
    description = xpath("div[5]//p//text()")),
    stringsAsFactors = FALSE)

Here the result should be a data.frame in which each columns is specified by a selector either css or xpath, so that only one step can create a data frame from web data that is regular enough.

An example of scraping house listing data from Zillow

I created an example of scraping houselisting data from Zillow. The R code is here.

There is one place where I found a bit of trouble (described in lines 28-37 of code). Sometimes when a particular css is not there for all nodes, still I would like to extract all nodes with NAs for non-existing entries. In this instance, there are 25 house listings but only 24 have lot area. To enable combining the different attributes into a single data, it will be nice to output 25 lot areas (with NA for the non-existing one). I have a workaround in the code above. I will be interested to know if there is a better way of accomplishing this.

submit_form error

if submit_form with button like <button type='submit'> login </button> will cause error: Error in names(submits)[[1]] : subscript out of bounds.

the normal submit button like <input type="submit" />.

The bad thing is the submit button didn't have name attribute, so i can't specify the name in submit_form function.

any idea?

submit_form produces <url> malformed error

I had the problem that submit_form always produced the following error when trying to enter a specific web page:

Submitting with 'login'
Error in function (type, msg, asError = TRUE)  : <url> malformed

A couple of days ago someone posted the same issue on SO and the answer given by MrFlick solved my issue:

Before submitting the form you have to explicitly set the url of the login form.

It seems that rvest has some problems when interpreting absolute URLs without the server name.

Reproducible example (The other one can be found on SO):

library(rvest)
library(magrittr)

my_url = "https://www.openair.com/index.pl"
openair <- html_session(my_url)

login <-  html_form(openair) %>%
  extract2(1) %>%
  set_values(
    account_nickname = "does_not_matter_here",
    user_nickname = "does_not_matter_here",
    password = "does_not_matter_here"
  )

openair %<>% submit_form(login)

The code above produces the described error. Taking a look at the beginning of login:

<form> 'login_page' (POST /index.pl)
<input hidden> '_form_has_changed': 0
...

However, adding login$url <- 'https://www.openair.com/index.pl' before submitting the form solves it.

In this case the start of login looks like this:

<form> 'login_page' (POST https://www.openair.com/index.pl)
<input hidden> '_form_has_changed': 0

Rules to automatically distinguish xpath query from css selector

It is nice for rvest to support both css selector and xpath query in html_node().

Sometimes I must use xpath to query more complicated webpage. Currently I need to set xpath = TRUE to turn on xpath query.

I wonder would it be possible that the function use some smart ways to easily distinguish xpath query from css selector since they look very different.

Are there such rules?

Client error: (406) Not Acceptable

Hi,
I'm having problems parsing the following:

url <- 'http://www.cccc.cat/base-de-dades'
html(url)

raises a:

Error en html.response(r, encoding = encoding) : 
  client error: (406) Not Acceptable

I've checked the encoding with:

> guess_encoding(url)
    encoding language confidence
1 ISO-8859-1       es       0.65
2 ISO-8859-2       ro       0.56
3 ISO-8859-9       tr       0.37
4      UTF-8                0.10
5  Shift_JIS       ja       0.10
6    GB18030       zh       0.10
7     EUC-JP       ja       0.10
8     EUC-KR       ko       0.10
9       Big5       zh       0.10

but adding the encoding (any) explicitly does not work either

html(url, encoding = 'ISO-8859-1')

However, using the XML packgage

htmlParse(url)

Does work. Any ideas? I'd rather use rvest since I need to use html_session()to work with forms, and the same error persists.
Cheers,

install_github fails

The travis CI icon lists the build as passing but running install_github fails for me in the vignette code:

install_github("hadley/rvest")
Installing github repo rvest/master from hadley
Downloading master.zip from https://github.com/hadley/rvest/archive/master.zip
Installing package from /var/folders/19/tp3bd8zj3nn559nhlllgsyc457_rdd/T//RtmpVgtQSV/master.zip
arguments 'minimized' and 'invisible' are for Windows only
Installing rvest
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD build
'/private/var/folders/19/tp3bd8zj3nn559nhlllgsyc457_rdd/T/RtmpVgtQSV/devtools124fb58396f88/rvest-master' --no-manual --no-resave-data

  • checking for file '/private/var/folders/19/tp3bd8zj3nn559nhlllgsyc457_rdd/T/RtmpVgtQSV/devtools124fb58396f88/rvest-master/DESCRIPTION' ... OK
  • preparing 'rvest':
  • checking DESCRIPTION meta-information ... OK
  • installing the package to build vignettes
  • creating vignettes ... ERROR
    Error: processing vignette 'selectorgadget.Rmd' failed with diagnostics:
    unrecognized fields specified in html_dependency: attachment
    Execution halted
    Error: Command failed (1)

When I run update.packages() all packages are up to date. Here is my output from R.Version():

R.Version()
$platform
[1] "x86_64-apple-darwin10.8.0"

$arch
[1] "x86_64"

$os
[1] "darwin10.8.0"

$system
[1] "x86_64, darwin10.8.0"

$status
[1] ""

$major
[1] "3"

$minor
[1] "0.2"

$year
[1] "2013"

$month
[1] "09"

$day
[1] "25"

$svn rev
[1] "63987"

$language
[1] "R"

$version.string
[1] "R version 3.0.2 (2013-09-25)"

$nickname
[1] "Frisbee Sailing"

html_form / parse_options error

Hey Hadley,

I noticed that the html_form function in some case returns the error:
Error in as.vector(x, "list") :
cannot coerce type 'externalptr' to vector of type 'list'

I started debugging a bit and I noticed that the error is returned during the evaluation of:

parsed <- lapply(options, parse_option)

inside the parse_options() function.

The error happens when the scraped web page contains a drop-down menu.
For instance, running html_form on the web page http://www.echoecho.com/htmlforms11.htm will return the error because of the html form:

<select name="dropdownmenu" size="1">
<option value="Butter">Butter</option>
<option value="Cheese">Cheese</option>
<option value="Milk">Milk</option>
</select>

I fixed the error changing

parsed <- lapply(options, parse_option)

into

 if(length(options) == 1)
parsed <- list(parse_option(options))
else
parsed <- lapply(options, parse_option)

but this is not very elegant

Cheers
Lorenzo

demo/tripadvisor.R code line 33 question

I just started working through the demo examples. First of all, thank you for starting to create this package. It looks awesome since the syntax is easy to understand and use.

In demo/tripadvisor.R code line 33 is:

review <- reviews %>%
  html_node(".partial_entry") %>%
  html_text()

Sometimes the .partial entry node has both the review and a response from the manager. So though the number of reviews in this case is 10, it extracts 19 entries. The following code works

review <- reviews %>% html_node("//p[@class='partial_entry'][1]",xpath=TRUE)%>%html_text()

Is there a way to write the above line without using xpath.

Regards,
Shankar

Recv failure: Connection reset by peer

trying to scrape http://www.nytimes.com/

html_session('http://www.nytimes.com/', verbose(), user_agent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.1.25 (KHTML, like Gecko) Version/8.0 Safari/600.1.25'))

but got the error

Error in function (type, msg, asError = TRUE)  : 
  Recv failure: Connection reset by peer

i check http://stackoverflow.com/questions/10285700/curl-error-recv-failure-connection-reset-by-peer-php-curl this issue, but it seems that there is no relation with this.

Significant memory leak on Windows

Hi Hadley,

I'm perfectly aware that this is neither cause by httr nor rvest, but the XML package. Nonetheless I thought I'd try my luck raising your attention as this bug has been around for years now and it also affects your packages: omegahat/XML#4.

Thanks a lot should you consider looking into this!

Error with html_node vs html_nodes

When using rvest to scrape data from a web page, sometimes it is useful to use a number of selectors to get at specific data. For example, given the page:

http://www.sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva

I want to extract the href attributes, of the anchor, in the table datas, of the table fightfinder_result. After discussing it with Hadley on twitter, he suggested:

html("http://sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva")  %>% html_nodes(".fightfinder_result tr") %>% html_node("a") %>% html_attr("href")

However, this failed with:

Error in UseMethod("xmlAttrs", node) : 
   no applicable method for 'xmlAttrs' applied to an object of class "NULL"

When I replaced html_node("a") with html_nodes("a"), it gave the proper results:

> html("http://sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva") %>%     html_nodes(".fightfinder_result tr") %>% html_nodes("a") %>% html_attr("href")
 [1] "/fighter/Anderson-Silva-1356"                   
 [2] "/fighter/Wanderson-Silva-13585"                 
 [3] "/fighter/Anderson-da-Silva-132861"              
 [4] "/fighter/Anderson-Pires-da-Silva-22925"         
 [5] "/fighter/Anderson-Silva-136541"                 
 [6] "/fighter/Janderson-Rodrigues-Silva-141103"      
 [7] "/fighter/Wanderson-Michel-de-Jesus-Silva-149637"
 [8] "/fighter/Anderson-Silva-167443"                 
 [9] "/fighter/Anderson-Silva-169715"                 
[10] "/fighter/Wanderson-Pantoja-da-Silva-172721"     

When I look at the str of both of these, they are both Lists with - attr(*, "class")= chr "XMLNodeSet", however html_node() returns a NULL as the first element (of an 11 item list) whereas html_nodes() returns a list of 10 items, all valid.

With html_node:

 > str(html("http://sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva") %>%      html_nodes(".fightfinder_result tr") %>% html_node("a"))
 List of 11
 $ : NULL
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
  - attr(*, "class")= chr "XMLNodeSet"

vs html_nodes:

> str(html("http://sherdog.com/stats/fightfinder?SearchTxt=Anderson+Silva") %>% html_nodes(".fightfinder_result tr") %>% html_nodes("a"))
List of 10
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 - attr(*, "class")= chr "XMLNodeSet"

I'm sorry now you hear about MMA, btw.

Not able to figure out an error while using submit_form()

I was trying a variation of demo/united.R. I wanted to mimic the action of going to united.com and checking flight status of flights by specifying origin and destination.

The code is:

library(rvest)
library(magrittr)

united <- html_session("http://www.united.com")

fhist=united%>%html_node("form")%>%extract2(1)%>%html_form()%>%
  set_values(
    'ctl00$ContentInfo$Checkinflightstatus$Origin$txtOrigin'='IND',
    'ctl00$ContentInfo$Checkinflightstatus$Destination$txtDestination'='ORD'
  )

fhist2=united%>%submit_form(fhist,'ctl00$ContentInfo$Checkinflightstatus$btnFlightStatus')

The error I am getting from the last line is

fhist2=united%>%submit_form(fhist,'ctl00$ContentInfo$Checkinflightstatus$btnFlightStatus')
Error in vapply(elements, encode, character(1)) : 
  values must be length 1,
 but FUN(X[[26]]) result is length 0

I would appreciate any thoughts on why this happens

bug in submit_form?

> url <- 'http://eagletreas.mohavecounty.us/treasurer/web/login.jsp'
> #  Form 1, the submit url is `../web/loginPOST.jsp`, session url is http://eagletreas.mohavecounty.us/treasurer/web/login.jsp
> s <- html_session(url)
> form <- html(s) %>% html_form() %>% .[[1]]

> XML::getRelativeURL(s$url, form$url)
[1] "http://eagletreas.mohavecounty.us/treasurer/web/login.jsp"
> XML::getRelativeURL(form$url, s$url)
[1] "http://eagletreas.mohavecounty.us/treasurer/web/loginPOST.jsp"

XML::getRelativeURL(form$url, s$url) is the right way to get relative url.

encoding issue

i try to scrape a webpage http://www3.boj.or.jp/market/jp/stat/of141205.htm

require(rvest)
url='http://www3.boj.or.jp/market/jp/stat/of141205.htm'

# bad, return string like: I�t�@�[ (12��5�ú���à��)
html(url, encoding='utf-8') %>% html_nodes('title') %>% html_text()
html(url, encoding='SHIFT_JIS') %>% html_nodes('title') %>% html_text()

# good, return: オファー (12月5日<金>)
html(readLines(url, encoding='utf-8')) %>% html_nodes('title') %>% html_text()

what is the difference between `html` and `readLines` in deal with encoding?

typo in error message

I just got this error message in rvest:

"Error: Table doesn't has different numbers of columns in different rows. Do you want fill = TRUE?"

Warning message: failed to parse headers

When running the demos I always get the following warning:

In addition: Warning message:
Failed to parse headers:
Vary:User-Agent

What could I do to prevent these from occuring? Setting user agent info via httr::user_agent?

Discrepancy of results between using XML and rvest

I am not sure what I am doing wrong but when I scrape a page from realtor.com, I am not getting the full list with rvest. I have listed my R code below.


#
# Scrape data on house listings from realtor.com
#

# set working directory
setwd("~/notesofdabbler/Rspace/dayoh_housing/")

# load libraries
library(rvest)
library(XML)


#
# Search URL with following filter applied
#3+ bedrooms, 2+ baths, 1800+ sqft, 0-20 years old
#
srchurl="http://www.realtor.com/realestateandhomes-search/Centerville_OH/beds-3/baths-2/sqft-8/pfbm-10/show-hide-pending"

# using XML library
housedoc=htmlTreeParse(srchurl,useInternalNodes=TRUE)
ns_id=getNodeSet(housedoc,"//ul[@class='listing-summary']//li[@class='listing-location']//a[@href]") 
id=sapply(ns_id,function(x) xmlAttrs(x)["href"])
id

# using rvest library
housedoc = html(srchurl) 
houselist = housedoc %>% html_node(".listing-summary")
id =  houselist %>% html_node(".listing-location a") %>% html_attr("href")
id

The actual run version of the code with output is here.

html_table() doesn't handle some things that readHTMLTable does

e.g.

library(rvest)
library(XML)
url <- "http://data.fis-ski.com/dynamic/results.html?sector=CC&raceid=22395"

tbls1 <- html_table(html(url)) #Error: Table doesn't have equal number of columns in every row
tbls2 <- readHTMLTable(doc = url) #Fills in with NAs

I'm not convinced that what readHTMLTable does is optimal, but it would be nice to be able to at least get some output from tables like this.

Maybe it would make more sense to return a list with each piece of the table that is "complete", rather than filling with NAs? Just a thought...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.