Git Product home page Git Product logo

gobblin's Introduction

PNDA Fork

This fork adds the following:

  • A new converter to convert messages read from Kafka to the PNDA Avro schema (gobblin.pnda.PNDAConverter)
  • A new writer writing data with the kitesdk library (gobblin.pnda.PNDAKiteWriterBuilder)

The converter can only work with a Kafka-compatible source. The workflow is:

KafkaSimpleSource ► PNDAConverter ► [SchemaRowCheckPolicy] ► PNDAKiteWriter

Build

Gobblin needs to be compiled against CDH hadoop dependencies with the right version:

$ ./gradlew clean build -PhadoopVersion="2.6.0-cdh5.9.0" -PexcludeHadoopDeps -PexcludeHiveDeps -x gobblin-core:test

Configuration

Kite datasets

Two kite datasets need to be created.

The first one is the main dataset where all the valid PNDA messages are stored. The schema is:

{"namespace": "pnda.entity",
 "type": "record",
 "name": "event",
 "fields": [
     {"name": "timestamp", "type": "long"},
     {"name": "src",       "type": "string"},
     {"name": "host_ip",   "type": "string"},
     {"name": "rawdata",   "type": "bytes"}
 ]
}

The second one is where all data in the wrong format are stored. The schema is:

{"namespace": "pnda.entity",
 "type"     : "record",
 "name"     : "DeserializationError",
 "fields"   : [
   {"name": "topic",     "type": "string"},
   {"name": "timestamp", "type": "long"},
   {"name": "reason",    "type": ["string", "null"]},
   {"name": "payload",   "type": "bytes"}
 ]
}

See the Kite CLI reference for information on creating a Kite dataset.

For example, to create the first dataset with the following partition strategy (partition.json file):

[
    {"type": "identity", "source": "src", "name": "source"},
    {"type": "year",     "source": "timestamp"},
    {"type": "month",    "source": "timestamp"},
    {"type": "day",      "source": "timestamp"},
    {"type": "hour",     "source": "timestamp"}
]

Then execute the following command:

$ kite-dataset create --schema pnda.avsc dataset:hdfs://192.168.0.227:8020/tmp/PNDA_datasets/master --partition-by partition.json

Property file

The source.schema property must be set to deserialize AVRO data coming from kafka:

source.schema={"namespace": "pnda.entity", \
               "type": "record",                            \
               "name": "event",                             \
               "fields": [                                  \
                   {"name": "timestamp", "type": "long"},   \
                   {"name": "src",       "type": "string"}, \
                   {"name": "host_ip",   "type": "string"}, \
                   {"name": "rawdata",   "type": "bytes"}   \
               ]                                            \
              }

It can also be a the filename of an AVRO schema.

The converter.classes property must include the gobblin.pnda.PNDAConverter class. PNDA.quarantine.dataset.uri is optional. If not set, wrong messages will be discarded.

converter.classes=gobblin.pnda.PNDAConverter
PNDA.quarantine.dataset.uri=dataset:hdfs://192.168.0.227:8020/tmp/PNDA_datasets/quarantine

The writer.builder.class property must be set to gobblin.pnda.PNDAKiteWriterBuilder to enable the Kite writer. kite.writer.dataset.uri is mandatory.

writer.builder.class=gobblin.pnda.PNDAKiteWriterBuilder
kite.writer.dataset.uri=dataset:hdfs://192.168.0.227:8020/tmp/PNDA_datasets/master

PNDA Dataset creation

In the context of the PNDA platform, Gobblin is normally run on a regular basis every 30 minutes. Datasets are created dynamically on HDFS by discovering available topics in Kafka.

It's important to note that datasets are not created by topics, but dependending on the src field in PNDA Avro messages.

Datasets are stored on HDFS in the /user/pnda/PNDA_datasets/datasets directory in the Kite format, and are partitioned by source and timestamp.

The current partition strategy will write the data read from Kafka to the following hierarchy:

source=SSS/year=YYYY/month=MM/day=DD/hour=HH

In the path above, SSS represents the value of the src field in the PNDA Avro schema.

For example, the following message read from Kafka:

{
  "timestamp": 1462886335,
  "src": "netflow",
  "host_ip": "192.168.0.15",
  "rawdata": "XXXXXxxxxXXXX"
}

will be written to:

/user/pnda/PNDA_datasets/datasets/source=netflow/year=2016/month=05/day=10/hour=13/random-uuid.avro

Please refer to the Gobblin for more information about Gobblin.

Gobblin Build Status Documentation Status

Quick Links

  • Documentation: Check out the Gobblin documentation for a complete description of Gobblin's features
  • Powered By: Check out the list of companies known to use Gobblin
  • Architecture: The Gobblin Architecture page has a full explanation of Gobblin's architecture
  • Getting Started with Gobblin: Refer to the Getting Started Guide on how to get started with Gobblin
  • Building Gobblin: Refer to the page Building Gobblin for directions on how to build Gobblin
  • Javadocs: The full JavaDocs for each released version of Gobblin can be found here

gobblin's People

Contributors

abti avatar aditya1105 avatar agepati avatar ahollenbach avatar anandrishabh avatar chavdar avatar dvenkateshappa avatar ibuenros avatar jack-moseley avatar jeclarke avatar jgarnier avatar jinhyukchang avatar jubarbot-cisco avatar klyr avatar lbendig avatar liyinan926 avatar narasimhareddyv avatar notrace7 avatar pcadabam avatar pcadabam-zz avatar rayortigas avatar sahiltakiar avatar shirshanka avatar stephanesan avatar suvodeep-pyne avatar trsmith2 avatar tugithub avatar xkrogen avatar ydai1124 avatar zliu41 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.