Git Product home page Git Product logo

thresher's People

Contributors

blahah avatar darobin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

thresher's Issues

Handle multiple file downloads with same filename

Currently if multiple files are downloaded with the same filename, subsequent downloads overwrite the first.

We need to do one or both of these:

  • automatically add an incrementing unique identifier to each filename if that file already exists (e.g. x.1, x.2, etc.)
  • allow scraperJSON to specify a numbering scheme

ScraperJSON feature: ability to nest elements

We need the ability to create nested elements.

An example use-case is getting details for authors in a journal article. Each author might have the following metadata:

  • name
  • affiliation
  • email

Ideally these things can be associated with one another, so that the extractor looks something like:

"authors": {
  "selectors": {
    "name": {
      "selector": "//some_selector",
      "attribute": "text"
    },
    "affiliation": {
      "selector": "//some_selector",
      "attribute": "text"
    },
    "email": {
      "selector": "//some_selector",
      "attribute": "text"
    }
  }
}

This allows a nice structured output like:

"authors": [
  {
    "name": "Some Person",
    "affiliation": "Miscellaneous Institute",
    "email": "[email protected]"
  },
  {
    "name": "Another Person",
    "affiliation": "Another Institute",
    "email": "[email protected]"
  }
]

tbc.

fix xpath selector namespace handling for HTML DOMs

The latest versions of the xpath library break namespace-free selectors for normal HTML documents (goto100/xpath#27), so we're stuck using v0.0.6. For now this is fine, but eventually we should seek a resolution, either with a fork of the main xpath lib, or by just maintaining it inside this project.

Handle download errors nicely

at the moment a download failure crashes the app. in particular we want to make sure we never try to download null URLs, and that we handle events emitted by the downloader.

Option to serialise the rendered DOM to a file

Using jsdom:

var jsdom = require("jsdom").jsdom;
var serializeDocument = require("jsdom").serializeDocument;

var doc = jsdom("<!DOCTYPE html>hello");

serializeDocument(doc) === "<!DOCTYPE html><html><head></head><body>hello</body></html>";
doc.documentElement.outerHTML === "<html><head></head><body>hello</body></html>";

Try HEAD-only scraping before moving to headless

It's worth considering whether first sending a HEAD request is worthwhile. If all the extractable elements are in the head, we don't need to fire up PhantomJS, and the site gets less hammered by scraping traffic.

This is a discussion point rather than a feature request

Handle network problems gracefully

Examples of network problems:

  • No network connection available at all
  • Can't resolve a URL
  • Can't connect to a URL
  • Connection extremely slow

We should detect these situations and handle them appropriately (e.g. by emitting events that could be presented to user as log messages).

Related: ContentMine/quickscrape#44

Thresher Crashes if the URL Resolution code doesn't get a response

This bug was initially provided by PMR as ContentMine/quickscrape#88

info: Saving logs to ./test2010-03-01/quickscrape1.2016-09-11-19-16.log
info: quickscrape 0.4.7 launched with...
info: - URLs from file: undefined
info: - Scraperdir: /home/pm286/journal-scrapers/scrapers
info: - Rate limit: 10 per minute
info: - Log level: info
info: urls to scrape: 13110
info: processing URL: http://dx.doi.org/10.1088/0965-0393/18/2/025015
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.1088/0965-0393/18/2/025016
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.4304/jsw.5.3.304-311
error: page did not return a 200 instead returned 500 so moving on to next url in list
info: processing URL: http://dx.doi.org/10.1209/0295-5075/89/69002
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.5373/jaram.223.092109
/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60
    callback(err, response.request.href);
                          ^

TypeError: Cannot read property 'request' of undefined
    at Request._callback (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60:27)
    at self.callback (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/request/request.js:368:22)
    at emitOne (events.js:96:13)
    at Request.emit (events.js:188:7)
    at Request.onRequestError (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/request/request.js:1025:8)
    at emitOne (events.js:96:13)
    at ClientRequest.emit (events.js:188:7)
    at Socket.socketErrorListener (_http_client.js:308:9)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
finished

Give more useful error on XPath failure

When an invalid XPath expression is used in a scraper, currently this unhelpful error is raised:

Error: XPath parse error

We should report the exact XPath expression that failed.

threseher and quickscrape

What is the relationship between thresher and quickscrape? Replacement or synergistic components?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.