Git Product home page Git Product logo

htmlunit's Introduction

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Signed by Signed commit % Linux build Status Coverage Status Minimal R Version License

htmlunit

Tools to Scrape Dynamic Web Content via the ‘HtmlUnit’ Java Library

Description

‘HtmlUnit’ (https://htmlunit.sourceforge.net/) is a “‘GUI’-Less browser for ‘Java’ programs”. It models ‘HTML’ documents and provides an ‘API’ that allows one to invoke pages, fill out forms, click links and more just like one does in a “normal” browser. The library has fairly good and constantly improving ‘JavaScript’ support and is able to work even with quite complex ‘AJAX’ libraries, simulating ‘Chrome’, ‘Firefox’ or ‘Internet Explorer’ depending on the configuration used. It is typically used for testing purposes or to retrieve information from web sites. Tools are provided to work with this library at a higher level than provided by the exposed ‘Java’ libraries in the ‘htmlunitjars’ package.

What’s Inside The Tin

The following functions are implemented:

DSL

  • web_client/webclient: Create a new HtmlUnit WebClient instance

  • wc_go: Visit a URL

  • wc_html_nodes: Select nodes from web client active page html content

  • wc_html_text: Extract attributes, text and tag name from webclient page html content

  • wc_html_attr: Extract attributes, text and tag name from webclient page html content

  • wc_html_name: Extract attributes, text and tag name from webclient page html content

  • wc_headers: Return response headers of the last web request for current page

  • wc_browser_info: Retreive information about the browser used to create the ‘webclient’

  • wc_content_length: Return content length of the last web request for current page

  • wc_content_type: Return content type of web request for current page

  • wc_render: Retrieve current page contents

  • wc_css: Enable/Disable CSS support

  • wc_dnt: Enable/Disable Do-Not-Track

  • wc_geo: Enable/Disable Geolocation

  • wc_img_dl: Enable/Disable Image Downloading

  • wc_load_time: Return load time of the last web request for current page

  • wc_resize: Resize the virtual browser window

  • wc_status: Return status code of web request for current page

  • wc_timeout: Change default request timeout

  • wc_title: Return page title for current page

  • wc_url: Return load time of the last web request for current page

  • wc_use_insecure_ssl: Enable/Disable Ignoring SSL Validation Issues

  • wc_wait: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing

Just the Content (pls)

  • hu_read_html: Read HTML from a URL with Browser Emulation & in a JavaScript Context

Content++

  • wc_inspect: Perform a “Developer Tools”-like Network Inspection of a URL

Installation

install.packages("htmlunit", repos = c("https://cinc.rud.is", "https://cloud.r-project.org/"))
# or
remotes::install_git("https://git.rud.is/hrbrmstr/htmlunit.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/htmlunit")
# or
remotes::install_gitlab("hrbrmstr/htmlunit")
# or
remotes::install_bitbucket("hrbrmstr/htmlunit")
# or
remotes::install_github("hrbrmstr/htmlunit")

NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.

Usage

library(htmlunit)
library(tidyverse) # for some data ops; not req'd for pkg

# current verison
packageVersion("htmlunit")
## [1] '0.5.0'

Something xml2::read_html() cannot do, read the table from https://hrbrmstr.github.io/htmlunitjars/index.html:

test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"

pg <- xml2::read_html(test_url)

html_table(pg)
## list()

☹️

But, hu_read_html() can!

pg <- hu_read_html(test_url)

html_table(pg)
## [[1]]
##      X1   X2
## 1   One  Two
## 2 Three Four
## 3  Five  Six

All without needing a separate Selenium or Splash server instance.

Content++

We can also get a HAR-like content + metadata dump:

xdf <- wc_inspect("https://rstudio.com")

colnames(xdf)
## [1] "method"         "url"            "status_code"    "message"        "content"        "content_length"
## [7] "content_type"   "load_time"      "headers"

select(xdf, method, url, status_code, content_length, load_time)
## # A tibble: 36 x 5
##    method url                                                                       status_code content_length load_time
##    <chr>  <chr>                                                                           <int>          <dbl>     <dbl>
##  1 GET    https://rstudio.com/                                                              200          14621       495
##  2 GET    https://metadata-static-files.sfo2.cdn.digitaloceanspaces.com/pixel/lp.js         200           3576       221
##  3 GET    https://snap.licdn.com/li.lms-analytics/insight.min.js                            200           1576       162
##  4 GET    https://connect.facebook.net/en_US/fbevents.js                                    200          34269       138
##  5 GET    https://connect.facebook.net/signals/config/151855192184380?v=2.9.23&r=s…         200         134841        66
##  6 GET    https://munchkin.marketo.net/munchkin-beta.js                                     200            752       230
##  7 GET    https://munchkin.marketo.net/159/munchkin.js                                      200           4810        27
##  8 GET    https://x.clearbitjs.com/v1/pk_60c5aa2221e3c03eca10fb6876aa6df7/clearbit…         200          86568       483
##  9 GET    https://cdn.segment.com/analytics.js/v1/gO0uTGfCkO4DQpfkRim9mBsjdKrehtnu…         200          62860       243
## 10 GET    https://static.hotjar.com/c/hotjar-1446157.js?sv=6                                200           1708       212
## # … with 26 more rows

group_by(xdf, content_type) %>% 
  summarise(
    total_size = sum(content_length), 
    total_load_time = sum(load_time)/1000
  )
## # A tibble: 7 x 3
##   content_type             total_size total_load_time
##   <chr>                         <dbl>           <dbl>
## 1 application/javascript       431338           2.58 
## 2 application/json               4118           1.37 
## 3 application/x-javascript     176248           0.623
## 4 image/gif                        35           0.232
## 5 text/html                     16640           1.36 
## 6 text/javascript              254971           0.996
## 7 text/plain                       28           0.189

DSL

wc <- web_client(emulate = "chrome")

wc %>% wc_browser_info()
## < Netscape / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 / en-US >

wc <- web_client()

wc %>% wc_go("https://usa.gov/")

# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()

wc %>%
  wc_html_nodes("a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)
##  [1] "Skip to main content"               ""                                   "EspaĂąol"                           
##  [4] "1-844-USA-GOV1"                     "All Topics and Services"            "About the U.S."                    
##  [7] "American Flag"                      "Branches of the U.S. Government"    "Budget of the U.S. Government"     
## [10] "Data and Statistics about the U.S."

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)
##  [1] "Skip to main content"               ""                                   "EspaĂąol"                           
##  [4] "1-844-USA-GOV1"                     "All Topics and Services"            "About the U.S."                    
##  [7] "American Flag"                      "Branches of the U.S. Government"    "Budget of the U.S. Government"     
## [10] "Data and Statistics about the U.S."

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_attr, "href") %>% 
  head(10)
##  [1] "#content"                "/"                       "/espanol/"               "/phone"                 
##  [5] "/#tpcs"                  "#"                       "/flag"                   "/branches-of-government"
##  [9] "/budget"                 "/statistics"

Handy function to get rendered plain text for text mining:

wc %>% 
  wc_render("text") %>% 
  substr(1, 300) %>% 
  cat()
## Official Guide to Government Information and Services | USAGov
## Skip to main content
## An official website of the United States government Here's how you know
## 
## 
## Main Navigation
## Search
## Search
## Search
## 1-844-USA-GOV1
## All Topics and Services
## Benefits, Grants, Loans
## Government Agencies and Elected Officials

htmlunit Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
R 14 0.70 341 0.72 188 0.70 377 0.82
Java 3 0.15 52 0.11 23 0.09 3 0.01
Rmd 1 0.05 41 0.09 52 0.19 75 0.16
Maven 1 0.05 30 0.06 0 0.00 1 0.00
make 1 0.05 10 0.02 4 0.01 4 0.01

clock Package Metrics for htmlunit

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

htmlunit's People

Contributors

hrbrmstr avatar steadygiant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

htmlunit's Issues

error message: Specified cast is not valid; An unhandled exception was generated during the execution of the current web request.

First, many thanks for (again another) very helpful package!

I am trying to use the htmlunit package to scrap results from https://juris.ohchr.org/Search/Documents

(the site doesn’t show any search results unless you select at least one search option, e.g treaty).

While browsing the site/reading the results works (most of the time), I get an error message when trying to retrieve the results with the htmlunit package.

Is this error exclusively related to the website in the sense that it is not properly set up, or is there something to htmlunit what triggers the error? If so, any means to circumvent this error with htmlunit?

If you think this is something better to put on SO let me know.

Thanks again!

library(htmlunit)
#> Loading required package: rJava
#> Loading required package: htmlunitjars
#> Loading required package: rvest
#> Loading required package: xml2
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.6.2

my_site2 <- "http://juris.ohchr.org/search/results/2?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
#my_site2 <- "https://juris.ohchr.org/search/results"

js_pg2 <- htmlunit::hu_read_html(my_site2)
js_pg2
#> {html_document}
#> <html>
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body bgcolor="white">\r\n    <span>\r\n      <h1>\r\n        Server Erro ...
html_nodes(js_pg2, "td")
#> {xml_nodeset (2)}
#> [1] <td>\r\n              <code>\r\n                \n\nAn unhandled exceptio ...
#> [2] <td>\r\n              <code>\r\n                <pre>\r\n                 ...

Created on 2020-03-05 by the reprex package (v0.3.0)

Goals

  • simple way to read dynamic html from a single page
  • DSL around the underlying power of HtmlUnit. While it's possible to just use htmlunitjars and create Java objects and call Java methods directly, that is a really foreign interface to most R folks and creating a DSL would make it easier for folks to click on forms, etc without having to worry about Java type conversion (which is 'a thing' for some methods in this library)

Error installing htmlunit

Hello,
I am encountering the error below when trying to install htmlunit. Am I missing something?
Many thanks for making R such a joy. r

> library(remotes)
> remotes::install_github("hrbrmstr/htmlunit")
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo hrbrmstr/htmlunit@HEAD
√  checking for file 'C:\Users\*****\AppData\Local\Temp\RtmpcfsVIp\remotes194852017dd7\hrbrmstr-htmlunit-b9dd94a/DESCRIPTION' (594ms)
-  preparing 'htmlunit': (7.9s)
√  checking DESCRIPTION meta-information ... 
-  checking for LF line-endings in source and make files and shell scripts (478ms)
-  checking for empty or unneeded directories
-  building 'htmlunit_0.5.0.tar.gz'
   
Installing package into ‘C:/Users/*****/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
* installing *source* package 'htmlunit' ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Error: (converted from warning) package 'rvest' was built under R version 4.0.4
Execution halted
ERROR: lazy loading failed for package 'htmlunit'
* removing 'C:/Users/****/Documents/R/win-library/4.0/htmlunit'
Error: Failed to install 'htmlunit' from GitHub:
  (converted from warning) installation of package ‘C:/Users/*****/AppData/Local/Temp/RtmpcfsVIp/file1948524517f/htmlunit_0.5.0.tar.gz’ had non-zero exit status

Scrrenshot with htmlunit

I have only one question: is it possible to take screenshots of the web page with htmlunit?

I couldn't find any function on this repo that do that.

add proxy handling

Thanks for this!
Currently the proxy parameters of the web client are not used.
I guess they have to be added during the web client creation? here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.