Git Product home page Git Product logo

scraperjson's People

Contributors

blahah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scraperjson's Issues

Create example scrapers with example results

It would be nice to have real-world example-json files
together with directory/file-collection, which are created by
running a scraper with a certain scraperJSON-json file.

That would be helpful to implement a scraper that follows the scraperJSON scheme/policy.

A *.zip or *.tgz file for results (or json-file and results) would make sense as examples, IMHO.

feature: followOn

{
  "url": "\\w+",
  "name": "followOn example",
  "followable": {
    "figurePage": {
      "selector": "//a[@class='full-figure']",
      "attribute": "href"
    }
  },
  "elements" : {
    "figure": {
      "follow": "figurePage",
      "caption": { "selector": "//figcaption" },
      "img": {
        "selector": "//figure//img",
        "attribute": "src"
      }
   }
}

feature: download only if content-type matches

A common problem when scraping scientific journal articles is if access to the PDF or other file downloads is not granted, but the server returns a 200 OK status and sends an HTML document telling the user they don't have access. In this case, a scraperJSON client will simply download the HTML page and may rename it to the user's specified filename, which leads to a confusing situation where an HTML document might be mislabelled as some other filetype.

A solution is to allow a download to specify one or more content-types that are permitted, or perhaps a regex that should match the content-type. If the content-type does not match, the download is skipped.

The client would implement this by performing a HEAD request to the download URL initially, then evaluating the Content-Type HTTP header, then deciding whether to proceed to full download.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.