contentmine / scraperjson Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 2.0 160 KB

The scraperJSON standard for defining web scrapers as JSON objects

License: Creative Commons Zero v1.0 Universal

scraperjson's People

Contributors

Stargazers

Watchers

Forkers

pombredanne wollmers

scraperjson's Issues

Create example scrapers with example results

It would be nice to have real-world example-json files
together with directory/file-collection, which are created by
running a scraper with a certain scraperJSON-json file.

That would be helpful to implement a scraper that follows the scraperJSON scheme/policy.

A *.zip or *.tgz file for results (or json-file and results) would make sense as examples, IMHO.

feature: ability to nest elements

See ContentMine/thresher#2

feature: followOn

{
  "url": "\\w+",
  "name": "followOn example",
  "followable": {
    "figurePage": {
      "selector": "//a[@class='full-figure']",
      "attribute": "href"
    }
  },
  "elements" : {
    "figure": {
      "follow": "figurePage",
      "caption": { "selector": "//figcaption" },
      "img": {
        "selector": "//figure//img",
        "attribute": "src"
      }
   }
}

feature: rename files on download

See ContentMine/thresher#3

feature: download only if content-type matches

A common problem when scraping scientific journal articles is if access to the PDF or other file downloads is not granted, but the server returns a 200 OK status and sends an HTML document telling the user they don't have access. In this case, a scraperJSON client will simply download the HTML page and may rename it to the user's specified filename, which leads to a confusing situation where an HTML document might be mislabelled as some other filetype.

A solution is to allow a download to specify one or more content-types that are permitted, or perhaps a regex that should match the content-type. If the content-type does not match, the download is skipped.

The client would implement this by performing a HEAD request to the download URL initially, then evaluating the Content-Type HTTP header, then deciding whether to proceed to full download.

contentmine / scraperjson Goto Github PK

scraperjson's People

Contributors

Stargazers

Watchers

Forkers

scraperjson's Issues

Create example scrapers with example results

feature: ability to nest elements

feature: followOn

feature: rename files on download

feature: download only if content-type matches

feature: much more robust location specification

feature: regex post-extraction

documentation: follow and followables

ScraperJson

feature: scraper names

feature: collect multiple downloads in a subdirectory

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent