contentmine / thresher Goto Github PK
View Code? Open in Web Editor NEWHeadless scraperJSON scraping for Node.js
License: MIT License
Headless scraperJSON scraping for Node.js
License: MIT License
Currently if multiple files are downloaded with the same filename, subsequent downloads overwrite the first.
We need to do one or both of these:
we should add an option to force thresher to use a scraper whose URL doesn't match
and when it's not forced, provide a useful error for a non-matching URL
We need the ability to create nested elements.
An example use-case is getting details for authors in a journal article. Each author might have the following metadata:
Ideally these things can be associated with one another, so that the extractor looks something like:
"authors": {
"selectors": {
"name": {
"selector": "//some_selector",
"attribute": "text"
},
"affiliation": {
"selector": "//some_selector",
"attribute": "text"
},
"email": {
"selector": "//some_selector",
"attribute": "text"
}
}
}
This allows a nice structured output like:
"authors": [
{
"name": "Some Person",
"affiliation": "Miscellaneous Institute",
"email": "[email protected]"
},
{
"name": "Another Person",
"affiliation": "Another Institute",
"email": "[email protected]"
}
]
tbc.
The latest versions of the xpath library break namespace-free selectors for normal HTML documents (goto100/xpath#27), so we're stuck using v0.0.6. For now this is fine, but eventually we should seek a resolution, either with a fork of the main xpath lib, or by just maintaining it inside this project.
at the moment a download failure crashes the app. in particular we want to make sure we never try to download null URLs, and that we handle events emitted by the downloader.
Using jsdom:
var jsdom = require("jsdom").jsdom;
var serializeDocument = require("jsdom").serializeDocument;
var doc = jsdom("<!DOCTYPE html>hello");
serializeDocument(doc) === "<!DOCTYPE html><html><head></head><body>hello</body></html>";
doc.documentElement.outerHTML === "<html><head></head><body>hello</body></html>";
It's worth considering whether first sending a HEAD request is worthwhile. If all the extractable elements are in the head, we don't need to fire up PhantomJS, and the site gets less hammered by scraping traffic.
This is a discussion point rather than a feature request
Examples of network problems:
We should detect these situations and handle them appropriately (e.g. by emitting events that could be presented to user as log messages).
Related: ContentMine/quickscrape#44
This bug was initially provided by PMR as ContentMine/quickscrape#88
info: Saving logs to ./test2010-03-01/quickscrape1.2016-09-11-19-16.log
info: quickscrape 0.4.7 launched with...
info: - URLs from file: undefined
info: - Scraperdir: /home/pm286/journal-scrapers/scrapers
info: - Rate limit: 10 per minute
info: - Log level: info
info: urls to scrape: 13110
info: processing URL: http://dx.doi.org/10.1088/0965-0393/18/2/025015
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.1088/0965-0393/18/2/025016
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.4304/jsw.5.3.304-311
error: page did not return a 200 instead returned 500 so moving on to next url in list
info: processing URL: http://dx.doi.org/10.1209/0295-5075/89/69002
error: Error: ETIMEDOUT so moving on to next url in list
info: processing URL: http://dx.doi.org/10.5373/jaram.223.092109
/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60
callback(err, response.request.href);
^
TypeError: Cannot read property 'request' of undefined
at Request._callback (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60:27)
at self.callback (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/request/request.js:368:22)
at emitOne (events.js:96:13)
at Request.emit (events.js:188:7)
at Request.onRequestError (/home/pm286/.nvm/versions/node/v6.3.1/lib/node_modules/quickscrape/node_modules/request/request.js:1025:8)
at emitOne (events.js:96:13)
at ClientRequest.emit (events.js:188:7)
at Socket.socketErrorListener (_http_client.js:308:9)
at emitOne (events.js:96:13)
at Socket.emit (events.js:188:7)
finished
When an invalid XPath expression is used in a scraper, currently this unhelpful error is raised:
Error: XPath parse error
We should report the exact XPath expression that failed.
It should be possible to specify a new name for a file on download. This allows the standardisation of scraping results across sites.
What is the relationship between thresher and quickscrape? Replacement or synergistic components?
this should probably be a flag in scraperJSON
resolve URLs before doing anything else with them
https://github.com/andris9/resolver
use first the unresolved and then the resolved URL to search for a scraper
ideally the user can pass in a list at the command line and these will be deleted from the loaded scraper
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.