Git Product home page Git Product logo

scrapio's Introduction

Scrapio

Asyncio web scraping framework. The project aims to make easy to write a high performance crawlers with little knowledge of asyncio, while giving enough flexibility so that users can customise behaviour of their scrapers. It also supports Uvloop, and can be used in conjunction with custom clients allowing for browser based rendering.

Install

pip install scrapio

The project can be installed using Pip.

Hello Crawl Example

from collections import defaultdict

import aiofiles # external dependency
import lxml.html as lh
from scrapio.crawlers.base_crawler import BaseCrawler # import from scrapio.scrapers on version 0.14 and lower
from scrapio.utils.helpers import response_to_html


class OurScraper(BaseCrawler):

    def parse_result(self, response):
        html = response_to_html(response)
        dom = lh.fromstring(html)

        result = defaultdict(lambda: "N/A")
        result['url'] = response.url
        title = dom.cssselect('title')
        h1 = dom.cssselect('h1')
        if title:
            result['title'] = title[0].text_content()
        if h1:
            result['h1'] = h1[0].text_content()
        return result

    async def save_results(self, result):
        if result:
            async with aiofiles.open('example_output.csv', 'a') as f:
                url = result.get('url')
                title = result.get('title')
                h1 = result.get('h1')
                await f.write('"{}","{}","{}"\n'.format(url, title, h1))


if __name__ == '__main__':
    scraper = OurScraper('http://edmundmartin.com')
    scraper.run_crawler(10)

The above represents a fully functional scraper using the Scrapio framework. We overide the parse_result and save_results from the base scraper class. We then initialize the crawler with our start URL and set the number of scraping processes and the number of parsing processes.

Custom Link Parsing

The default behaviour of the link parser can be overwriting the behaviour of the base link parsing class as is outlined in the example below.

from collections import defaultdict

import aiofiles # external dependency
import lxml.html as lh
from scrapio.crawlers import BaseCrawler
from scrapio.utils.helpers import response_to_html
from scrapio.structures.filtering import URLFilter


class PythonURLFilter(URLFilter):

    def can_crawl(self, host: str, url: str):
        if 'edmundmartin.com' in host and 'python' in url.lower():
            return True
        return False


class OurScraper(BaseCrawler):

    def parse_result(self, response):
        html = response_to_html(response)
        dom = lh.fromstring(html)

        result = defaultdict(lambda: "N/A")
        result['url'] = response.url
        title = dom.cssselect('title')
        h1 = dom.cssselect('h1')
        if title:
            result['title'] = title[0].text_content()
        if h1:
            result['h1'] = h1[0].text_content()
        return result

    async def save_results(self, result):
        if result:
            async with aiofiles.open('example_output.csv', 'a') as f:
                url = result.get('url')
                title = result.get('title')
                h1 = result.get('h1')
                await f.write('"{}","{}","{}"\n'.format(url, title, h1))


if __name__ == '__main__':
    scraper = OurScraper('http://edmundmartin.com', custom_filter=PythonURLFilter)
    scraper.run_crawler(10)

scrapio's People

Contributors

dependabot[bot] avatar edmundmartin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

scrapio's Issues

Subclass ClientResponse

Sub-classing client response will allow us to save time by only building a DOM and HTML response once.

Proxies

Greetings,

I'm interested to see how you manage proxies, I want to keep things simple and not have an established DB that manages the status/state of proxies.

For me, I want to place proxies on a cooldown list if they trigger a failed response due to abuse of the targetted site. I was thinking, I would use a Python Dic, and append some kind of notification of a proxies state. either ready to be used, or skip proxy and timestamp it, so it could return to the active pool of proxies after 15-25mins.

However, with more than one worker, I'm confused at how to share this information among them within a Asyncio framework.

Problems with link parsing

On certain certain sites the crawler is straying off and making requests to URLs which should not be crawled. This seems to be quite a subtle issue as it's not happening on all sites. Will take a look into the issue sometime in the future when I have the time.

Implement support for Pyppeteer

Already support for using Splash add ability to use Pyppeteer as an alternative browser rendering engine. Could be potentially quicker than Splash if properly implemented.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.