Light

Common Crawl Foundation photo

commoncrawl Goto Github PK

repos: 51.0 gists: 0.0

Name: Common Crawl Foundation

Type: Organization

Bio: Common Crawl provides an archive of webpages going back to 2007.

Twitter: commoncrawl

Blog: https://commoncrawl.org

Common Crawl Support Library

Overview

This library provides support code for the consumption of the Common Crawl Corpus RAW crawl data (ARC Files) stored on S3. More information about how to access the corpus can be found at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set .

You can take two primary routes to consuming the ARC File content:

(1) You can run a Hadoop cluster on EC2 or use EMR to run a Hadoop job. In this case, you can use the ARCFileInputFormat to drive data to your mappers/reducers. There are two versions of the InputFormat: One written to conform to the deprecated mapred package, located at org.commoncrawl.hadoop.io.mapred and one written for the mapreduce package, correspondingly located at org.commoncrawl.hadoop.io.mapreduce.

(2) You can decode data directly by feeding an InputStream to the ARCFileReader class located in the org.commoncrawl.util.shared package.

Both routes (InputFormat or ARCFileReader direct route) produce a tuple consisting of a UTF-8 encoded URL (Text), and the raw content (BytesWritable), including HTTP headers, that were downloaded by the crawler. The HTTP headers are UTF-8 encoded, and the headers and content are delimited by a consecutive set of CRLF tokens. The content itself, when it is of a text mime type, is encoded using the source text encoding.

Build Notes:

You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
Set hadoop.path (in build.properties) to point to your Hadoop distribution.

Sample Usage:

Once the commoncrawl.jar has been built, you can validate that the ARCFileReader works for you by executing the sample command line from root for the commoncrawl source directory:

./bin/launcher.sh org.commoncrawl.util.shared.ARCFileReader --awsAccessKey <ACCESS KEY> --awsSecret <SECRET> --file s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690164240/1341819847375_4319.arc.gz

Common Crawl Foundation's Projects

cc-citations

Scientific articles using or citing Common Crawl data

cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

cc-index-server

Common Crawl Index Server

cc-index-table

Index Common Crawl archives in tabular format

cc-legal

Repository for legal documentation at the Common Crawl Foundation

cc-monitoring

Code that monitors Common Crawl infrastructure

cc-mrjob

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

cc-notebooks

Various Jupyter notebooks about Common Crawl data

cc-nutch-example

Apache Nutch example project to archive content in WARC files

cc-pyspark

Process Common Crawl data with Python and Spark

cc-quick-scripts

Scripts to verify Common Crawl segments and WARC/WET/WAT files

cc-warc-examples

CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

cc-webgraph

Tools to construct and process webgraphs from Common Crawl data

cdx-index-client

A command-line tool for using Common Crawl Index API at http://index.commoncrawl.org/

chatnoir-resiliparse

A robust web archive analytics toolkit

common_crawl_index

Index URLs in Common Crawl (2012)

commoncrawl

Common Crawl support library to access 2008-2012 crawl archives (ARC files)

commoncrawl-crawler

The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)

commoncrawl-examples

A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)

commoncrawl_notebooks

data_tooling

Tools for managing datasets for governance and training.

discussions

For discussions and collaboration among all those who use or seek to use Common Crawl data

example-apprankings

example-babel2012

example-bill-tracker

These map reduce functions use Common Crawl data to look at the spread of congressional legislation on the internet

example-companyfootprints

example-europeanjob

example-ismoneyrootevil

example-javascriptusage

example-languageentropy

1
2

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.