Git Product home page Git Product logo

cloudera-framework's Introduction

Cloudera Framework

Provide an example organisation wide Cloudera (i.e. Hadoop ecosystem) project framework, defining corporate standards on runtime components, datasets, libraries, testing, deployment and project structure to facilitate operating within a continuous deployment pipeline. This example includes client/runtime/thirdparty bill-of-materials, utility/driver libraries and a unit test harness with examples, providing full coverage against CDH:

  • MR2
  • Kudu
  • HDFS
  • Flume
  • Kafka
  • Impala
  • ZooKeeper
  • Spark & Spark2
  • Hive/MR & Hive/Spark

The framework can target managed services provisioned by Cloudera Altus, automated cluster deployments via Cloudera Director and or manually managed clusters via Cloudera Manager.

Examples are included, codifying the standards, providing end to end data streaming, ingest, modeling, testing pipelines, with synthetic datasets to exercise the codebase.

Finally, a set of archetypes are included to provide bare bones starter client modules.

Requirements

To compile, build and package from source, this project requires:

  • Java 8
  • Maven 3
  • Scala 2.11
  • Python 2.7
  • Anaconda 4
  • Cloudera Altus CLI 2.2
  • Cloudera Director Client 2.6
  • Python Cloudera Manager API 5

The bootstrap.sh script tests for, configures and installs (where possible) the required toolchain and should be sourced as so:

. ./bootstrap.sh environment

To run the unit and integrations tests, binaries and meta-data are provided for all CDH components:

  • CentOS/RHEL 6.x
  • CentOS/RHEL 7.x
  • Ubuntu LTS 14.04.x
  • MacOS 10.13.x (Impala unit tests are no-op'd)

Note that in addition to Maven dependencies, Cloudera parcels are used to manage platform dependent binaries by way of the cloudera-parcel-plugin. Impala parcels are not available for non-Linux containers.

Limitations

As above, this code is known to not work out of the box on Windows hosts, only Linux and MacOS are supported. If developing on Windows it is recommended to run a Linux container and develop from within it.

Install

This project can be compiled, packaged and installed to a local repository, skipping tests, as per:

mvn install -PPKG

To only compile the project:

mvn install -PCMP

To run the tests for both Scala 2.10 (default) and 2.11 (localhost must be resolvable to run the tests):

mvn test
mvn test -pl cloudera-framework-testing -PSCALA_2.11

The bootstrap script provides convenience mechanisms to build and release the project as so:

./bootstrap.sh build release

Usage

The cloudera-framework includes a set of examples:

  • Example 1 (Java, HSQL, Flume, MR, Hive/MR, Impala, HDFS)
  • Example 2 (Java, HSQL, Kafka, Hive/Spark, Spark, Impala, S3)
  • Example 3 (Scala, CDSW, Spark2, MLlib, PMML, HDFS)
  • Example 4 (Envelope, Kafka, Spark2 Streaming, Kudu, HDFS)
  • Example 5 (Python, NLTK, PySpark, Spark2, HDFS)

In addition, archetypes are available in various profiles, allowing one to bootstrap a new cloudera-framework client module:

For example, a project could be created with the Spark2 profile baseline, including a very simple example targeting a Cloudera Altus runtime as below:

mvn org.apache.maven.plugins:maven-archetype-plugin:2.4:generate -B \
  -DarchetypeRepository=http://52.63.86.162/artifactory/cloudera-framework-releases \
  -DarchetypeGroupId=com.cloudera.framework.archetype \
  -DarchetypeArtifactId=cloudera-framework-archetype-spark2 \
  -DarchetypeVersion=2.0.5-cdh5.15.1 \
  -DgroupId=com.myorg.mytest \
  -DartifactId=mytest \
  -Dpackage=com.myorg.mytest \
  -DaltusEnv=my_altus_environment \
  -DaltusCluster=my_cluster \
  -DaltusS3Bucket=my_s3_bucket

Note that in order to run against the Cloudera Altus Amazon AWS runtime as above, both the "AWS_ACCESS_KEY" and "AWS_SECRET_KEY" are required to be set in the environment and each Maven archetype parameter with the "altus" prefix has to be given an appropriate value. The "altusS3Bucket" parameter should specify a valid S3 bucket which the user has read/write access to, within the "altusEnv" region and has data stored under the "/data/workload/input" key with schema like the pristine test data set.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.