Git Product home page Git Product logo

dataproc-java-submitter's Introduction

dataproc-java-submitter

CircleCI Maven Central License

A small java library for submitting Hadoop jobs to Google Cloud Dataproc from Java.

Why?

In many real world usages of Hadoop, the jobs are usually parameterized to some degree. Parameters can be anything from job configuration to input paths. It is common to resolve these parameter arguments in some workflow tool that eventually puts the arguments on a command line that is passed to the Hadoop job. On the job side, these arguments have to be parsed using various tools that are more or less standard.

However if the argument resolution environment is in a JVM, dropping down to a shell and invoking a command line can be pretty complicated and roundabout. It is also very limiting in terms of what can be passed to the job. It is not uncommon to take more structured data and store in some seralized format, stage the files, and have custom logic in the job to deserialize it.

This library aims to more seamlessly bridge between a local JVM instance and the Hadoop application entrypoint.

Usage

Maven dependency

<dependency>
  <groupId>com.spotify</groupId>
  <artifactId>dataproc-java-submitter</artifactId>
  <version><!-- see version in maven badge above --></version>
</dependency>

Example usage

String project = "gcp-project-id";
String cluster = "dataproc-cluster-id";

DataprocHadoopRunner hadoopRunner = DataprocHadoopRunner.builder(project, cluster).build();
DataprocLambdaRunner lambdaRunner = DataprocLambdaRunner.forDataproc(hadoopRunner);

// Use any structured type that is Java Serializable
MyStructuredJobArguments arguments = resolveArgumentsInLocalJvm();

lambdaRunner.runOnCluster(() -> {

  // This lambda, including its closure will run on the Dataproc cluster
  System.out.println("Running on the cluster, with " + arguments.inputPaths());

  return 42; // rfc: is it worth supporting a return value from the job?
});

The DataprocLambdaRunner will take care of configuring the Dataproc job so that it can run your lambda function. It will scan your local classpath and ensure that the loaded jars are staged and configured for the Dataproc job. It will also take care of serializing, staging and deserializing the lambda closure that is to be invoked on the cluster.

Note that anything referenced from the lambda has to implement java.io.Serializable

Low level usage

This library can also be used to configure the Dataproc job directly.

String project = "gcp-project-id";
String cluster = "dataproc-cluster-id";

DataprocHadoopRunner hadoopRunner = DataprocHadoopRunner.builder(project, cluster).build();

Job job = Job.builder()
    .setMainClass(...)
    .setArgs(...)
    .setProperties(...)
    .setShippedJars(...)
    .setShippedFiles(...)
    .createJob();


hadoopRunner.submit(job);

dataproc-java-submitter's People

Contributors

rouzwawi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.