commoncrawl / commoncrawl Goto Github PK

View Code? Open in Web Editor NEW

488.0 63.0 91.0 13.2 MB

Common Crawl support library to access 2008-2012 crawl archives (ARC files)

Shell 0.11% Java 35.55% C++ 37.44% C 26.79% HTML 0.02% Makefile 0.05% M4 0.05%

inactive archived

commoncrawl's Issues

WARN[0060] error instantiating commoncrawl: commoncrawl.apiResult: decode slice: expect [ or n, but found , error found in #0 byte of ...||..., bigger context ...||...

Build errors and class ArcFileItem

Hi All!

When trying to build JAR using Ant receive following errors:

compile-core-classes:
[javac] Compiling 167 source files to /root/commoncrawl/commoncrawl-commoncrawl-24052ae/build/classes
[javac] warning: [options] bootstrap class path not set in conjunction with -source 1.5
[javac] /root/commoncrawl/commoncrawl-commoncrawl-24052ae/src/org/commoncrawl/hadoop/io/ARCSplitReader.java:34: error: package org.commoncrawl.protocol.shared does not exist
[javac] import org.commoncrawl.protocol.shared.ArcFileItem;
[javac] ^
[javac] /root/commoncrawl/commoncrawl-commoncrawl-24052ae/src/org/commoncrawl/hadoop/io/ARCSplitReader.java:42: error: cannot find symbol
[javac] public class ARCSplitReader implements RecordReader<Text, ArcFileItem> {
.....
.....
lots of errors related to ArcFileItem.
...

May be some files with ArcFileItem class definition are missing? Using grep I didn't find any class ArcFileItem definition in src/*

Eugene

VerifyError

I'm trying Common Crawl w/ Hadoop 0.20.205 and I'm getting the following:

Exception in thread "main" java.lang.VerifyError: (class: org/commoncrawl/hadoop/io/JetS3tARCSource, method: configureImpl signature: (Lorg/apache/hadoop/mapred/JobConf;)V) Incompatible argument to function
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:819)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:864)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:890)
at org.commoncrawl.hadoop.io.ARCInputFormat.configure(ARCInputFormat.java:159)
at com.digitalpebble.behemoth.io.commoncrawl.CommonCrawlConverterJob.run(CommonCrawlConverterJob.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.digitalpebble.behemoth.io.commoncrawl.CommonCrawlConverterJob.main(CommonCrawlConverterJob.java:75)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I've tried clean builds, etc. Anyone have any ideas? AIUI, this error usually comes from version mismatches. I deleted the jetst 0.6 version in my Hadoop lib directory, but I suspect there is an older version somewhere else in the classpath. Anyone else seen this?

comment on public ARCFileReader constructor is confusing

On line 102 of commoncrawl/src/main/java/org/commoncrawl/util/shared/ARCFileReader.java, there's a comment that says the constructor is private (it's actually public), and refers to the "factory method above" even though it's the first method in the file.

  /** 
   * constructor is now private. use the factory method above to construct a reader 
   * @param source
   * @throws IOException
   */
  public ARCFileReader(final InputStream source)throws IOException {
    super(new CustomPushbackInputStream(new CountingInputStream(source),
        _blockSize), new Inflater(true), _blockSize);
    readARCHeader();
  }

Update binaries path in build.xml

Attempting to build with ant failes to download the ant tasks as the URL is incorrect.

It should be:

http://mirror.uoregon.edu/apache/maven/ant-tasks/2.1.3/binaries

Add jar to maven central repository?

What do you think about adding a built version of this to the central repository?

Index for WET files?

Hi, hope I am posting this questions in the right place...

I found .WARC format domain index at http://index.commoncrawl.org/CC-MAIN-2016-18//
I wonder if there is any indexing for .WET format files?

If not, is there anyway I could convert the WARC object address to WET object address?
For example, if I have:
s3://commoncrawl/crawl-data/CC-MAIN-2016-18/segments/1461860125175.9/warc/CC-MAIN-20160428161525-00221-ip-10-239-7-51.ec2.internal.warc.gz
What would be the corresponding .WET file?

Thx...

Different formats ?

is it possible to be able to get this in Zim file format to use with https://kiwix.org/en/
this is an ofline internet project which enable for the creation of zim files an archive which can be browsed offline safeley: as well as in places in which have no access to internet such as remote locations ...

i have seen some copys of parts of this archive on the interent archive , the problem is it should be segmented by language . and placed in to these archives so that it can be a useful resourse to other whom are not data scientist but simple teacher who require offline access to such a large data resource :
the files at present are for the rich man only as you need a cloud just to be able to access the files ! despite being shared on various platforms :
in zim format it will be avaliable for all people to have access :
in the past the shard files were even corrupt on painful download (when the internet craw was much smaller)

thanks and please consider , : If it is possible :
as a user case we could say these snapshots could then be browsed by archive : hence the smaller the archives the easier it is for low tech people ! ( ie each shard should be individual to itself and non reliant on the other segments) hence being selectable !

Broken link on main site

I can't find a repository for the main site, but I thought I'd report it

The link "A distributed system for mining Common Crawl using SQS, AWS-EC2 and S3 by Akshay Bhat" on https://commoncrawl.org/the-data/examples/ is broken. From googling it looks like it should point to https://github.com/jrs026/CommonCrawlMiner maybe

commoncrawl / commoncrawl Goto Github PK

commoncrawl's Issues

WARN[0060] error instantiating commoncrawl: commoncrawl.apiResult: decode slice: expect [ or n, but found , error found in #0 byte of ...||..., bigger context ...||...

Build errors and class ArcFileItem

VerifyError

comment on public ARCFileReader constructor is confusing

Update binaries path in build.xml

Add jar to maven central repository?

Index for WET files?

Different formats ?

Broken link on main site

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent