Git Product home page Git Product logo

avro-utils's Introduction

Avro Utils

Util library to be able to use Avro files as input and output of Hadoop Map/Reduce Jobs or use Avro files as input of Hadoop Streaming.

Avro I/O in Map/Reduce Jobs

Avro Input

Use com.tomslabs.grid.avro.AvroFileInputFormat to use Avro files as the input of Map/Reduce jobs.

'map()' will be called with a key of type Avro's GenericRecord and a NullWritable value.

Avro Output

Use com.tomslabs.grid.avro.AvroFileOuputFormat to use Avro files as the output of Map/Reduce jobs.

In reduce() method, the context must emit with a key of type Avro's GenericRecord and a NullWritable value. Please note that the Avro Schema for the output data MUST be specified when setting up the Job.

Example

src/test/java/com/tomslabs/grid/avro/AvroWordCount.java is an example showing how to use Avro files for both input and output of a Map/Reduce jobs.

This example can be run as unit test from AvroWordCountTest.java.

Avro Input for Hadoop Streaming

To use Avro files as input for Hadoop Streaming, use the jar generated by the project and specify the correct input format:

$ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -libjars ./avro-utils-<VERSION>.jar,avro-1.5.1.jar,avro-mapred-1.5.1.jar  \
    -inputformat com.tomslabs.grid.avro.AvroTextFileInputFormat \
    -input <Avro file or dir> \
    -output <output dir> \
    -mapper <map command> \
    -reducer <reducer command>

For example, to count the number of expression, something like:

$ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -libjars ./avro-utils-<VERSION>.jar,avro-1.5.1.jar,avro-mapred-1.5.1.jar  \
    -inputformat com.tomslabs.grid.avro.AvroTextFileInputFormat \
    -input /tmp/word-count.avro \
    -output /tmp/out \
    -mapper /bin/cat \
    -reducer /usr/bin/wc

The format of each line streamed through Avro looks like:

<JSON representation of a Avro record>\t

(i.e. there is a trailing tabulation at the end of each line)

Avro Input for Dumbo

To use Avro files as input for Dumbo, use the jar generated by the project and set correct input format to com.tomslabs.grid.avro.AvroAsTextTypedBytesInputFormat:

$ dumbo start <PYTHON_SCRIPT> \
     -input /tmp/word-count.avro \
     -output /tmp/out \
     -libjar avro-1.5.1.jar \
     -libjars avro-mapred-1.5.1.jar  \
     -libjar avro-utils-<VERSION> \
     -inputformat com.tomslabs.grid.avro.AvroAsTextTypedBytesInputFormat
     -hadoop <HADOOP_HOME>
     -python <PYTHON_HOME>
     -outputformat text

The Python script's mapper will get the Avro record as a JSON string in its value parameter (the key parameter is not used).

Avro Output for Dumbo

You can use Avro files as the output for Dumbo. This expects that the reducer will emit as the value a string containing the JSON representation of an Avro record. To store in Avro (binary) files instead of text file, when you start Dumbo, you must specifiy the properties:

-libjar avro-1.5.1.jar \
-libjars avro-mapred-1.5.1.jar  \
-libjar avro-utils-<VERSION> \
-outputformat com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat \
-hadoopconf avro.output.schema=$SCHEMA

where SCHEMA is a String containing the JSON representation of the Avro schema to use to create the Avro records.

Avro Output for Hadoop Streaming

You can use Avro files as the output for Hadoop Streaming. This expects to receive in the reducer a Text key containing the JSON representation of a Avro record. To store in Avro (binary) files instead of text file, you must specifiy the properties:

-D avro.output.schema=$SCHEMA \
-libjars ./avro-utils-<VERSION>.jar,avro-1.5.1.jar,avro-mapred-1.5.1.jar  \
-reducer com.tomslabs.grid.avro.JSONTextToAvroRecordReducer \
-outputformat org.apache.avro.mapred.AvroOutputFormat 

where SCHEMA is a String containing the JSON representation of the Avro schema to use to create the Avro records.

Links

avro-utils's People

Contributors

jmesnil avatar laurentvaills avatar

Watchers

James Cloos avatar Guru Dharmateja Medasani avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.