Git Product home page Git Product logo

dwtc-tools's Introduction

DWTC-Tools: A Java Library for working with the Dresden Web Table Corpus

This repository contains Java classes to help you get started using the Dresden Web Table Corpus, a collection of roughly 100M data tables extracted from a publicly available web crawl, the Common Crawl.

It contains code for:

  • parsing / deserializing the data to Java objects (package webreduce.data)
  • parallel (single machine) iteration over the corpus with examples (package webreduce.iterator). Since the corpus is only 30GB when compressed, processing it on one machine using these specialized iterators is usually faster than using Hadoop or similar software. Still, a sample Hadoop job is also provided, see below.
  • creating a Lucene index over the corpus, including some preprocessing (package webreduce.indexing)
  • accessing the full text of the originating page of each table directly from Common Crawls S3 when needed (package webreduce.fulltext)
  • the Hadoop deduplication job that was applied to the raw extract data, which can also serve as a template for other Hadoop jobs (package webreduce.hadoop)
  • various utilities, e.g., heuristics for typing columns (identifying number columns etc.)

Getting Started

First, visit the DWTC page to download the corpus (or any subset of it) using the instructions found there. There is no need to decompress the data, both the iterators provided here and Hadoop can work on the compressed data directly.

The easiest way to build the library is using Maven (or importing into an Maven-compatible IDE):

mvn install

This will also create a standalone JAR file containing all dependencies in the target directory. Then, to run, for example, the bundled example iterator that extracts the first line of each table:

java -cp target/dwtc-tools-1.0.0-jar-with-dependencies.jar webreduce.iterator.examples.AttributesExtractor ~/dwtc-corpus ~/attributes.csv

The flags used in the example are documented in the code.

Generally, to process the data, either subclass WebreduceIterator as in the provided examples, modify the provided Hadoop job, or work on uncompressed JSON documents using just the class Dataset which provides a fromJson() static method for parsing the lines of the corpus.

dwtc-tools's People

Contributors

julianeberius avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dwtc-tools's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.