Git Product home page Git Product logo

kafka2s3's Introduction

kafka2s3 overview

This project provides a Spark structured streaming job to sink Kafka to S3 (or HDFS).

  • The data are read from Kafka topic with Spark structured streaming.
  • The supported format of the input messages is JSON.
  • The data are read in a type-safe way. The messages are casted to a custom case class defined in the config file.
  • The write to S3 is triggered through a configurable period defined in the config file.
  • The data are partitioned by custom columns defined in the config file. For the timestamp column, a custom pattern could be specified for the partitioning.

Build the jar

Create the config file from the example

cp conf/example-default.conf connector.conf

Edit the config file and set the config

vim connector.conf

Define the config:

  • kafka-url: list of Kafka brokers.
  • kafka-topic: the name of the input Kafka topic.
  • case-class-name: your custom case class matching the JSON schema of the messages you want to read.
  • batch-interval: the buffering period before to trigger the write to S3.
  • output-path: the output path. This could be hdfs folder fs:///<yourfolder> or S3 bucket s3a://<your bucket>. Notice that for S3 sink, you must use s3a protocol.
  • output-format: the format of the output files. You can use any format supported by Spark (parquet, csv, JSON).
  • partition-columns: the partitioning of the output files. The recommanded partition is the timestamp one (if you have timestamp in your data). Example of the supported time patterns:
    • YYYYMMddHH: hourly
    • YYYYMMdd: daily
    • YYYYMM: monthly

Run

Set AWS access key and secret access key. The argument of the spark-sumbit is the path to your configuration file.

  • export AWS_ACCESS_KEY_ID=<your AWS access key>
  • export AWS_SECRET_ACCESS_KEY=<your AWS secret access key>

Run the job with spark-submit (Spark 2.4.4 compiled with hadoop 2.9.2):

spark-submit --class com.connector.kafka2s3.ConnectorApp target/scala-2.11/kafka2s3-assembly-1.0.0.jar connector.conf

kafka2s3's People

Contributors

mbayoudh avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

enns

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.