Git Product home page Git Product logo

commoncrawl-crawler's Introduction

This is the primary repository for the services & map-reduce jobs used to produce the CommonCrawl web corpus.

Tree Structure

  • org.commoncrawl.async - Utility code used to build Async server.
  • org.commoncrawl.hadoop.io - ARCInputFormat and related classes.
  • org.commoncrawl.hadoop.mergeutils - Support for merge-sorts outside the context of a Hadoop job.
  • org.commoncrawl.hadoop.template - Sample Hadoop Job.
  • org.commoncrawl.io - CommonCrawl IO library used by crawlers.
  • org.commoncrawl.mapred - Root for all MapReduce jobs. Also contains data structure definitions shared across jobs (database.jr).
  • org.commoncrawl.mapred.ec2.parser - Code used to generate ARCFiles and intermediate data on EC2 using EMR.
  • org.commoncrawl.mapred.ec2.postprocess.deduper - Code to support a parallel dedupe using a 64bit Simhash.
  • org.commoncrawl.mapred.ec2.postprocess.linkCollector - Code to merge metadata generated by the parser job.
  • org.commoncrawl.mapred.pipelineV3 - The start of the new Nutch Free map-reduce pipeline used to process crawl metadata and generate new crawl lists.
  • org.commoncrawl.mapred.segmenter - Support code used to generate Crawl Segment (URL lists consumed by the crawlers).
  • org.commoncrawl.protocol - Shared data structure and enum definitions (generated).
  • org.commoncrawl.rpc - CommonCrawl RPC library used to build distributed systems.
  • org.commoncrawl.server - CommonCrawl Server base class used by various services.
  • org.commoncrawl.service - All long lived processes in the CommonCrawl system are house under this directory.
  • org.commoncrawl.service.crawler - The crawler long running process (Consumes Crawl Lists, writes content to HDFS).
  • org.commoncrawl.service.crawlhistory - A service that manages a crawler's crawl state in a BloomFilter.
  • org.commoncrawl.service.directory - A barebones service used to store and subscribe to lists via a path.
  • org.commoncrawl.service.dns - CommonCrawl DNS Service (used by crawlers to queue up DNS requests).
  • org.commoncrawl.service.listcrawler - A different type of list crawler that supports dynamic uploading a crawling of very large lists of URLS.
  • org.commoncrawl.service.pagerank - PageRank Master / Slave implementations (and related code) used to compute PageRank across the graph.
  • org.commoncrawl.service.parser - The beginnings of a distributed parser service that Crawlers can use to do on demand link extraction.
  • org.commoncrawl.service.queryserver - The (deprecated) crawl metadata service.
  • org.commoncrawl.service.statscollector - Service that receives crawl stats.
  • org.commoncrawl.util - The catch-all repository of Utility classes used by the CommonCrawl system.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Contributors

Ahad Rana (ahad at commoncrawl.org)

commoncrawl-crawler's People

Contributors

ahadrana avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.