Git Product home page Git Product logo

martinsaenger / schedoscope Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ottogroup/schedoscope

0.0 1.0 1.0 2 MB

Schedoscope is a scheduling framework for painfree agile development, testing, (re)loading, and monitoring of your datahub, lake, or whatever you choose to call your Hadoop data warehouse these days.

Home Page: http://schedoscope.org

License: Apache License 2.0

Scala 88.99% Java 10.87% PigLatin 0.14%

schedoscope's Introduction

Schedoscope

Introduction

Schedoscope is a scheduling framework for painfree agile development, testing, (re)loading, and monitoring of your datahub, lake, or whatever you choose to call your Hadoop data warehouse these days.

With Schedoscope,

  • you never have to create DDL and schema migration scripts;
  • you do not have to manually determine which data must be deleted and recomputed in face of retroactive changes to logic or data structures;
  • you specify Hive table structures (called "views"), partitioning schemes, storage formats, dependent views, as well as transformation logic in a concise Scala DSL;
  • you have a wide range of options for expressing data transformations - from file operations and MapReduce jobs to Pig scripts, Hive queries, and Oozie workflows;
  • you benefit from Scala's static type system and your IDE's code completion to make less typos that hit you late during deployment or runtime;
  • you can easily write unit tests for your transformation logic and run them quickly right out of your IDE;
  • you schedule jobs by expressing the views you need - Schedoscope takes care that all required dependencies - and only those- are computed as well;
  • you achieve a higher utilization of your YARN cluster's resources because job launchers are not YARN applications themselves that consume cluster capacitity.

Getting Started

Get a glance at

Follow the Open Street Map tutorial to install, compile, and run Schedoscope in a standard Hadoop distribution image within minutes:

Take a look at the View DSL Primer to get more information about the capabilities of the Schedoscope DSL:

More documentation can be found here:

When is Schedoscope not for you?

Schedoscope is based on the following assumptions:

  • data are largely relational and meaningfully representable as Hive tables;
  • there is enough cluster time and capacity to actually allow for retroactive recomputation of data;
  • it is acceptable to compile table structures, dependencies, and transformation logic into what is effectively a project-specific scheduler.

Should any of those assumptions not hold in your context, you should probably look for a different scheduler.

Origins

Schedoscope was conceived at the Business Intelligence department of Otto Group

Contributions

The following people have contributed to the various parts of Schedoscope so far:

Utz Westermann (maintainer), Hans-Peter Zorn, Dominik Benz, Annika Leveringhaus

We would love to get contributions from you as well. We haven't got a formalized submission process yet. If you have an idea for a contribution or even coded one already, get in touch with Utz or just send us your pull request. We will work it out from there.

Please help making Schedoscope better!

News

01/22/2016 - Release 0.3.5

We have released Version 0.3.5 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This release migrates Schedoscope's Hadoop dependencies to CDH-5.5.1. Furthermore, the test framework has been ported to Hive 1.1.0. Finally, Schedoscope's resilience against Metastore failures has been improved. It is able to reconnect and resume work when the Metastore has become unavailable in more error cases.

11/21/2015 - Release 0.3.4

We have released Version 0.3.4 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This release fixes a bug in Schedoscope which led to not correctly instantiating ViewActors for newly appearing dependencies such as date changes. Moreover, checksum versioning code has been cleaned up. Note that checksumming is not backwards compatible; you might want to execute your next materializations with the -m RESET_TRANSFORMATION_CHECKSUMS option.

11/13/2015 - Release 0.3.3

We have released Version 0.3.3 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This release gets some order into the logging framework mess inherited from the various libraries used. It does so by routing Java util logging and Apache commons logging through SLF4J and SLF4J to logback. By muting log4j and setting an appropriate logback-test.xml test outputs are now a lot less chatty.

11/10/2015 - Release 0.3.2

We have released Version 0.3.2 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This fixes a nasty resource leak in the Touch FileSystemTransformation

11/09/2015 - Release 0.3.1

We have released Version 0.3.1 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Fields can now be given comments as well: val id = fieldOf[String]("An ID.")

11/06/2015 - Release 0.3.0

We have released Version 0.3.0 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This is a big release, with the following major changes:

  • Migration to Scala 2.11 and Akka 2.3.14
  • Support of Hive 1.1.0 in test framework
  • Significant code cleanup
  • Significant round of Scaladoc documentation
  • Significant performance improvements when dealing with many views / partitions

Please note that the cleanup incurred some breaking of the API. In particular, the storage format classes have been moved to a separate package org.schedoscope.dsl.storageformats. Moreover, the various path builders for views have been renamed in a more systematic way. See Storage Paths.

Community / Forums

Build Status

Build Status

License

Licensed under the Apache License 2.0

schedoscope's People

Contributors

hpzorn avatar dominikbenz avatar utzwestermann avatar aleveringhaus avatar martinsaenger avatar ktohme avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.