contentmine / thresher Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 8.0 213 KB

Headless scraperJSON scraping for Node.js

License: MIT License

JavaScript 99.74% HTML 0.26%

thresher's People

Contributors

Stargazers

Watchers

Forkers

darobin pombredanne iweave lanzer tarrow getbioinfo azai91

thresher's Issues

Handle multiple file downloads with same filename

Currently if multiple files are downloaded with the same filename, subsequent downloads overwrite the first.

We need to do one or both of these:

automatically add an incrementing unique identifier to each filename if that file already exists (e.g. x.1, x.2, etc.)
allow scraperJSON to specify a numbering scheme

When a specific scraper is provided, don't enforce url matching

we should add an option to force thresher to use a scraper whose URL doesn't match
and when it's not forced, provide a useful error for a non-matching URL

ScraperJSON feature: ability to nest elements

We need the ability to create nested elements.

An example use-case is getting details for authors in a journal article. Each author might have the following metadata:

name
affiliation
email

Ideally these things can be associated with one another, so that the extractor looks something like:

"authors": {
  "selectors": {
    "name": {
      "selector": "//some_selector",
      "attribute": "text"
    },
    "affiliation": {
      "selector": "//some_selector",
      "attribute": "text"
    },
    "email": {
      "selector": "//some_selector",
      "attribute": "text"
    }
  }
}

This allows a nice structured output like:

"authors": [
  {
    "name": "Some Person",
    "affiliation": "Miscellaneous Institute",
    "email": "[email protected]"
  },
  {
    "name": "Another Person",
    "affiliation": "Another Institute",
    "email": "[email protected]"
  }
]

tbc.

fix xpath selector namespace handling for HTML DOMs

The latest versions of the xpath library break namespace-free selectors for normal HTML documents (goto100/xpath#27), so we're stuck using v0.0.6. For now this is fine, but eventually we should seek a resolution, either with a fork of the main xpath lib, or by just maintaining it inside this project.

Handle download errors nicely

at the moment a download failure crashes the app. in particular we want to make sure we never try to download null URLs, and that we handle events emitted by the downloader.

Option to serialise the rendered DOM to a file

Using jsdom:

var jsdom = require("jsdom").jsdom;
var serializeDocument = require("jsdom").serializeDocument;

var doc = jsdom("<!DOCTYPE html>hello");

serializeDocument(doc) === "<!DOCTYPE html><html><head></head><body>hello</body></html>";
doc.documentElement.outerHTML === "<html><head></head><body>hello</body></html>";

Try HEAD-only scraping before moving to headless

It's worth considering whether first sending a HEAD request is worthwhile. If all the extractable elements are in the head, we don't need to fire up PhantomJS, and the site gets less hammered by scraping traffic.

This is a discussion point rather than a feature request

implement a node-ish error pattern throughout

http://www.joyent.com/developers/node/design/errors

Handle network problems gracefully

Examples of network problems:

No network connection available at all
Can't resolve a URL
Can't connect to a URL
Connection extremely slow

We should detect these situations and handle them appropriately (e.g. by emitting events that could be presented to user as log messages).

Related: ContentMine/quickscrape#44

Thresher Crashes if the URL Resolution code doesn't get a response

This bug was initially provided by PMR as ContentMine/quickscrape#88

info: Saving logs to ./test2010-03-01/quickscrape1.2016-09-11-19-16.log
info: quickscrape 0.4.7 launched with...
info: - URLs from file: undefined
info: - Scraperdir: /home/pm286/journal-scrapers/scrapers
info: - Rate limit: 10 per minute
info: - Log level: info
info: urls to scrape: 13110
info: processing URL: http://dx.doi.org/10.1088/0965-0393/18/2/025015
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.1088/0965-0393/18/2/025016
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.4304/jsw.5.3.304-311
error: page did not return a 200 instead returned 500 so moving on to next url in list
info: processing URL: http://dx.doi.org/10.1209/0295-5075/89/69002
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.5373/jaram.223.092109
/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60
    callback(err, response.request.href);
                          ^

TypeError: Cannot read property 'request' of undefined
    at Request._callback (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60:27)
    at self.callback (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/request/request.js:368:22)
    at emitOne (events.js:96:13)
    at Request.emit (events.js:188:7)
    at Request.onRequestError (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/request/request.js:1025:8)
    at emitOne (events.js:96:13)
    at ClientRequest.emit (events.js:188:7)
    at Socket.socketErrorListener (_http_client.js:308:9)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
finished

Give more useful error on XPath failure

When an invalid XPath expression is used in a scraper, currently this unhelpful error is raised:

Error: XPath parse error

We should report the exact XPath expression that failed.

use first the unresolved and then the resolved URL to search for a scraper

allow a subset of defined elements to be extracted

ideally the user can pass in a list at the command line and these will be deleted from the loaded scraper