contentmine / scraperjson Goto Github PK
View Code? Open in Web Editor NEWThe scraperJSON standard for defining web scrapers as JSON objects
License: Creative Commons Zero v1.0 Universal
The scraperJSON standard for defining web scrapers as JSON objects
License: Creative Commons Zero v1.0 Universal
It would be nice to have real-world example-json files
together with directory/file-collection, which are created by
running a scraper with a certain scraperJSON-json file.
That would be helpful to implement a scraper that follows the scraperJSON scheme/policy.
A *.zip or *.tgz file for results (or json-file and results) would make sense as examples, IMHO.
{
"url": "\\w+",
"name": "followOn example",
"followable": {
"figurePage": {
"selector": "//a[@class='full-figure']",
"attribute": "href"
}
},
"elements" : {
"figure": {
"follow": "figurePage",
"caption": { "selector": "//figcaption" },
"img": {
"selector": "//figure//img",
"attribute": "src"
}
}
}
A common problem when scraping scientific journal articles is if access to the PDF or other file downloads is not granted, but the server returns a 200 OK
status and sends an HTML document telling the user they don't have access. In this case, a scraperJSON client will simply download the HTML page and may rename it to the user's specified filename, which leads to a confusing situation where an HTML document might be mislabelled as some other filetype.
A solution is to allow a download
to specify one or more content-types that are permitted, or perhaps a regex that should match the content-type. If the content-type does not match, the download is skipped.
The client would implement this by performing a HEAD
request to the download URL initially, then evaluating the Content-Type
HTTP header, then deciding whether to proceed to full download.
We should use the model PeerLibrary has developed, in which they store essentially all the information available about the location of an annotation
Add ability to specify regex post-extraction and a way to decide how the captures are handled.
See ContentMine/quickscrape#12 for discussion.
Need to document this feature and update the spec
simple name key
this could be a part of the rename option, or could be its own option
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.