Git Product home page Git Product logo

awesome-crawler's Introduction

Awesome-crawler Awesome

A collection of awesome web crawler,spider and resources in different languages.

Contents

Python

  • Scrapy - A fast high-level screen scraping and web crawling framework.
  • pyspider - A powerful spider system.
  • cola - A distributed crawling framework.
  • Demiurge - PyQuery-based scraping micro-framework.
  • Scrapely - A pure-python HTML screen-scraping library.
  • feedparser - Universal feed parser.
  • you-get - Dumb downloader that scrapes the web.
  • Grab - Site scraping framework.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • portia - Visual scraping for Scrapy.
  • crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
  • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
  • MSpider - A simple ,easy spider using gevent and js render.
  • brownant - A lightweight web data extracting framework.
  • PSpider - A simple spider frame in Python3.
  • Gain - Web crawling framework based on asyncio for everyone.
  • sukhoi - Minimalist and powerful Web Crawler.
  • spidy - The simple, easy to use command line web crawler.
  • newspaper - News, full-text, and article metadata extraction in Python 3

Java

  • ACHE Crawler - An easy to use web crawler for domain-specific search.
  • Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
    • anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
  • Crawler4j - Simple and lightweight web crawler.
  • JSoup - Scrapes, parses, manipulates and cleans HTML.
  • websphinx - Website-Specific Processors for HTML information extraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • Gecco - A easy to use lightweight web crawler
  • WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
  • Webmagic - A scalable crawler framework.
  • Spiderman - A scalable ,extensible, multi-threaded web crawler.
    • Spiderman2 - A distributed web crawler framework,support js render.
  • Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
  • SeimiCrawler - An agile, distributed crawler framework.
  • StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
  • Spark-Crawler - Evolving Apache Nutch to run on Spark.
  • webBee - A DFS web spider.

C#

  • ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.
  • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
  • DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
  • Abot - C# web crawler built for speed and flexibility.
  • Hawk - Advanced Crawler and ETL tool written in C#/WPF.
  • SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.

JavaScript

  • scraperjs - A complete and versatile web scraper.
  • scrape-it - A Node.js scraper for humans.
  • simplecrawler - Event driven web crawler.
  • node-crawler - Node-crawler has clean,simple api.
  • js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
  • webster - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
  • x-ray - Web scraper with pagination and crawler support.
  • node-osmosis - HTML/XML parser and web scraper for Node.js.
  • web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.
  • supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

PHP

  • Goutte - A screen scraping and web crawling library for PHP.
  • dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
  • pspider - Parallel web crawler written in PHP.
  • php-spider - A configurable and extensible PHP web spider.
  • spatie/crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

C++

C

  • httrack - Copy websites to your computer.

Ruby

  • Nokogiri - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
  • upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
  • RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
  • Spidr - Spider a site ,multiple domains, certain links or infinitely.
  • Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.
  • mechanize - Automated web interaction & crawling.

R

  • rvest - Simple web scraping for R.

Erlang

  • ebot - A scalable, distribuited and highly configurable web cawler.

Perl

  • web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Go

  • pholcus - A distributed, high concurrency and powerful web crawler.
  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  • go_spider - An awesome Go concurrent Crawler(spider) framework.
  • dht - BitTorrent DHT Protocol && DHT Spider.
  • ants-go - A open source, distributed, restful crawler engine in golang.
  • scrape - A simple, higher level interface for Go web scraping.
  • creeper - The Next Generation Crawler Framework (Go).
  • colly - Fast and Elegant Scraping Framework for Gophers.

Scala

  • crawler - Scala DSL for web crawling.
  • scrala - Scala crawler(spider) framework, inspired by scrapy.
  • ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

awesome-crawler's People

Contributors

aecio avatar asciimoo avatar briatte avatar brucedone avatar gustavorps avatar jorgelbg avatar kaznovac avatar mayankpratap avatar moklick avatar rivermont avatar sriharshakappala avatar stewartmckee avatar tmos avatar zhuyingda avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.