Git Product home page Git Product logo

commoncrawl's Issues

Build errors and class ArcFileItem

Hi All!

When trying to build JAR using Ant receive following errors:

compile-core-classes:
[javac] Compiling 167 source files to /root/commoncrawl/commoncrawl-commoncrawl-24052ae/build/classes
[javac] warning: [options] bootstrap class path not set in conjunction with -source 1.5
[javac] /root/commoncrawl/commoncrawl-commoncrawl-24052ae/src/org/commoncrawl/hadoop/io/ARCSplitReader.java:34: error: package org.commoncrawl.protocol.shared does not exist
[javac] import org.commoncrawl.protocol.shared.ArcFileItem;
[javac] ^
[javac] /root/commoncrawl/commoncrawl-commoncrawl-24052ae/src/org/commoncrawl/hadoop/io/ARCSplitReader.java:42: error: cannot find symbol
[javac] public class ARCSplitReader implements RecordReader<Text, ArcFileItem> {
.....
.....
lots of errors related to ArcFileItem.
...

May be some files with ArcFileItem class definition are missing? Using grep I didn't find any class ArcFileItem definition in src/*

Eugene

VerifyError

I'm trying Common Crawl w/ Hadoop 0.20.205 and I'm getting the following:

Exception in thread "main" java.lang.VerifyError: (class: org/commoncrawl/hadoop/io/JetS3tARCSource, method: configureImpl signature: (Lorg/apache/hadoop/mapred/JobConf;)V) Incompatible argument to function
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:819)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:864)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:890)
at org.commoncrawl.hadoop.io.ARCInputFormat.configure(ARCInputFormat.java:159)
at com.digitalpebble.behemoth.io.commoncrawl.CommonCrawlConverterJob.run(CommonCrawlConverterJob.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.digitalpebble.behemoth.io.commoncrawl.CommonCrawlConverterJob.main(CommonCrawlConverterJob.java:75)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I've tried clean builds, etc. Anyone have any ideas? AIUI, this error usually comes from version mismatches. I deleted the jetst 0.6 version in my Hadoop lib directory, but I suspect there is an older version somewhere else in the classpath. Anyone else seen this?

comment on public ARCFileReader constructor is confusing

On line 102 of commoncrawl/src/main/java/org/commoncrawl/util/shared/ARCFileReader.java, there's a comment that says the constructor is private (it's actually public), and refers to the "factory method above" even though it's the first method in the file.

  /** 
   * constructor is now private. use the factory method above to construct a reader 
   * @param source
   * @throws IOException
   */
  public ARCFileReader(final InputStream source)throws IOException {
    super(new CustomPushbackInputStream(new CountingInputStream(source),
        _blockSize), new Inflater(true), _blockSize);
    readARCHeader();
  }

Index for WET files?

Hi, hope I am posting this questions in the right place...

I found .WARC format domain index at http://index.commoncrawl.org/CC-MAIN-2016-18//
I wonder if there is any indexing for .WET format files?

If not, is there anyway I could convert the WARC object address to WET object address?
For example, if I have:
s3://commoncrawl/crawl-data/CC-MAIN-2016-18/segments/1461860125175.9/warc/CC-MAIN-20160428161525-00221-ip-10-239-7-51.ec2.internal.warc.gz
What would be the corresponding .WET file?

Thx...

Different formats ?

is it possible to be able to get this in Zim file format to use with https://kiwix.org/en/
this is an ofline internet project which enable for the creation of zim files an archive which can be browsed offline safeley: as well as in places in which have no access to internet such as remote locations ...

i have seen some copys of parts of this archive on the interent archive , the problem is it should be segmented by language . and placed in to these archives so that it can be a useful resourse to other whom are not data scientist but simple teacher who require offline access to such a large data resource :
the files at present are for the rich man only as you need a cloud just to be able to access the files ! despite being shared on various platforms :
in zim format it will be avaliable for all people to have access :
in the past the shard files were even corrupt on painful download (when the internet craw was much smaller)

thanks and please consider , : If it is possible :
as a user case we could say these snapshots could then be browsed by archive : hence the smaller the archives the easier it is for low tech people ! ( ie each shard should be individual to itself and non reliant on the other segments) hence being selectable !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.