Git Product home page Git Product logo

cyborg's Introduction

Cyborg

Cyborg is an asyncio Python 3 web scraping framework that helps you write programs to extract information from websites by reading and inspecting their HTML.

What?

Scraping websites for data can be fairly complex when you are dealing with data across multiple pages, request limits and error handling. Cyborg aims to handle all of this for you transparently, so that you can focus on the actual extraction of data rather than all the stuff around it. It does this by helping you break the process down into smaller chunks, which can be combined into a Pipeline, for example below is a Pipeline that scrapes takeaway menus from Just-Eat:

from cyborg import Pipeline
from example.scrapers.justeat import *
import json

with open("output", "w") as fd:
  just_eat_pipeline = Pipeline()\
    .set_host("http://just-eat.co.uk")\
    .feed(("chicken", "pizza", "kebab", "american", "italian"))\
    .pipe(AreaScraper)\
    .pipe(TakeawayScraper)\
    .unique("id")\
    .pipe(MenuScraper)\
    .output(lambda o: fd.write(json.dumps(o) + "\n"))

This Pipeline has several stages:

  1. feed(("chicken", "pizza", "kebab", "american", "italian"))
    • This feeds five cuisines into the pipeline. feed() accepts other arguments like a file descriptor or a generator
  2. pipe(AreaScraper) (source)
  3. pipe(TakeawayScraper) (source)
    • These URL's are piped into the TakeawayScraper, this produces a list of takeaways with an ID and a URL
  4. unique("id") (source)
    • This section of the pipeline only outputs data that has a unique "id" key, so if a takeaway is scraped twice it is filtered out here
  5. pipe(MenuScraper) (source)
    • These unique takeaways are piped into the MenuScraper which extracts data like the food offered with prices and the address
  6. output(lambda o: fd.write(json.dumps(o) + "\n"))
    • This function writes a JSON representation of the data to the output file

Running a pipeline is as simple as asyncio.get_event_loop().run_until_complete(pipe.run()). This then handles things like retrying failed requests, tracking exceptions/errors and parallel connections.

Currently any exceptions are logged and totalled for each process within a pipeline, however in the future the library will persist more data upon exceptions like the current pages HTML and offer the ability to reply these failed tasks during development.

Writing a scraper

Writing a scraper is really simple. Here is the entire implementation for the AreaScraper:

from cyborg import Page, Scraper

class AreaScraper(Scraper):
  page_format = Page("/{input}-takeaways")

  def scrape(self, data, response):
      for link_list in response.find(".links"):
          for link in link_list.find("a"):
              yield {}, link.attr["href"]

Every scraper must have a scrape(data, response) function. This should then yield (data, url), the data is passed to the next scraper in the pipeline along with the URL response. This can be queried using CSS selectors.

Running the example

You can run the example by just executing python3 run.py inside the example/ directory. Every second you will see output like this:

AreaScraper         :      5: {}
TakeawayScraper     :      6: {}
_UniqueProcessor    :     26: {}
MenuScraper         :     16: {}
GeoIPScraper        :      7: {}

This is the status of the pipeline, the number is the number of tasks that have been processed. The dictionary to the right will display error totals, for examle {"notfound":4, "exception":2}. Data should start to appear inside results file as soon as the GeoIPScraper has processed some data.

What works?

This is just an alpha at the moment, the example works but there is still a lot to be done:

  • Rate limiting
  • Configurable number of workers
  • Testing
  • Parallel pipelines:
    • Pipeline.parallel(pipeline1, pipeline2).pipe(pipeline3) - run two pipelines in parallel and pipe it to a third
  • Documentation
  • Clarify the distinction between a processor and a scraper

cyborg's People

Contributors

orf avatar

Watchers

hagongzi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.