Git Product home page Git Product logo

blueshift-1's Introduction

Alt text

Overview

Flipkart Blueshift is a reliable data migration tool for Hadoop ecosystem, Comes packed with some elegant features missing in traditional migration tools like DistCP,

  • Migration across clusters (secured and unsecured)
  • On the fly compression, where source data is compressed over the wire, (this can benefit only when the job is run from the source cluster)
  • Inplace inflate/deflate.On successful migration data on the destination cluster can be seamlessly inflated to original state.
  • Bulk migration with batches of over 10Million files.
  • Multiple Statemanager options, either HDFS or DB based statemanagers. choose the one that scales for you.
  • Optimized task scheduling. Largest file first strategy
  • Multiplexing large filesets to manageable worker count.
  • Configuration driven, the tool can be tuned to a fine level using configurations.
  • Support different protocols, HDFS, WEBHDFS, HFTP, FTP, Custom etc...
  • Fault-Tolerant, Reliable, restores from point of failure.
  • DB based state management, easy report generation and state management.
  • Initiate the transfers from source or destination or any third cluster
  • Capable of using different protocols for source and destination.
  • MD5 based checksum to ensure no corruptions in data during transfer.
  • Support for custom transport codecs.
  • Supports Time based file filtering, which allows for incremental copy of files from one cluster to the other,
  • Supports filesize based filtering, this serves multiple purposes, when we want to ignore zero sized files, or we want to ignore compression on small files, and put a cap on the large files etc..
  • Option to ignore exceptions and continue processing and report state to status file and the state db.
  • Support for various other filters like include/exclude file lists, include/exclude file patterns etc..

Upcoming Features,

  • Support for other protocols like kafka, cassandra, Databases etc...
  • On the fly compression irrespective of job execution location.
  • Support for multiple sinks, simultaneously copy data over to kafka and hdfs from any source.

Prerequisites

  • Hadoop cluster running version 1.0 or higher
  • Statestore, Relational database like mysql or derby etc... if Statestore is DB. for HDFS Statestore no db is required.

BlueShift Configuration

Blueshift is configured via a configuration json file, There are 4 main sections that needs to be configured,

  • Batch Config - This defines the batch properties.
  • Source Config - This defines the Source configuration with connection and protocol details to access the source.
  • Sink Config - This defines the destination Sink configuration with connection properties to access the destination.
  • Statestore DB Config - This defines the database details that will be used as a statestore.

Refer to the configuration page for details on the different config options available.

https://github.com/flipkart-incubator/blueshift/wiki/Configuration

Use-Cases

Following are some of the usecases and the configs to be used.

  • secure to non-secure clusters - Secure to non-secure cluster transfer is unreliable with distcp. This works seamlessly.
  • Cross DC data transfer of large data with limited bandwidth - Typically transfer of large dataset across DC requires the data to be compressed at source, followed by distcp and decompression at destination. This requires additional effort and space at both source and destination cluster. BlueShift supports on the fly compression at source so data is compressed while transferring in chunks. At destination, it is possible to decompress and delete source files as each file is processed hence additional space usage is minimized at destination.
  • State management - While transferring large amounts of data quite often transfers are interrupted either due to source/destination cluster issues or flaky network. Since transfer state is managed in mysql DB, even if a transfer is interrupted it can be resumed.
  • Timestamp based snapshots - Usually the directories being transferred are actively used by teams as well. With Blueshift, we can mention a start time and end time so files modified within that limit is transferred. This will help ensure a consistent copy of data is migrated similar to HBase Export/Copytable.
  • Throttle transfer - We can set the number of parallel transfers at start so that only that many mappers are launched.

Download

The tool can be downloaded from,

http://github.com/flipkart-incubator/blueshift

Build and Run

Step 1: Clone the project to your local,

    git clone https://github.com/flipkart-incubator/blueshift.git

    git checkout Stable-Version-1.0

Step 2: Build the project, switch to the cloned project directory and execute

    mvn clean install -DskipTests

Step 3: Copy the newly created jar to your gateways and run the below command,

    hadoop jar blueshift.jar -Pdriver.json

Contributors

Mukund Thakur

Vishal Rajan

Prathyush Babu

Rahul Agarwal

Kasinath

Chetna Chaudhari

Chackaravarthy E

Sushil Kumar S

Guruprasad GV

Dhiraj Kumar

Raj Velu - Architect

Kurian Cheeramelil

Contact

Blueshift is owned and maintained by Flipkart Data Platform Infra Team,

FAQ

Refer to the FAQ page below,

https://github.com/flipkart-incubator/blueshift/wiki/FAQ

blueshift-1's People

Contributors

rahul67 avatar raj-velu avatar regunathb avatar sushil-k-s avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.