Git Product home page Git Product logo

hydra's Introduction

hydra

Hydra is a distributed data processing and storage system originally developed at AddThis. It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).

You can run hydra from the command line to slice and dice that Apache access log you have sitting around (or that gargantuan csv file). Or if terabytes per day is your cup of tea run a Hydra Cluster that supports your job with resource sharing, job management, distributed backups, data partitioning, and efficient bulk file transfer.

Building

Assuming you have Apache Maven installed and configured:

mvn package

Should compile and build jars. All hydra dependencies should be available on maven central but hydra itself is not yet published.

Berkeley DB Java Edition is used for several core features. The sleepycat license has strong copyleft properties that do not match the rest of the project. It is set as a non-transitive dependency to avoid inadvertently pulling it into downstream projects. In the future hydra should have pluggable storage with multiple implementations.

The hydra-uber module builds an exec jar containing hydra and all of it's dependencies. To include BDB JE when building with mvn package use -P bdbje. The main class of the exec jar launches the various components of a hydra cluster by name.

System dependencies

JDK 7 is required. Hydra has been developed on Linux (Centos 6) and should work on any modern Linux distro. Other unix-like systems should work with minor changes but have not been tested. Mac OSX should work for building and running local-stack (see below).

Hydra uses rabbitmq for low volume command and control message exchange. On a modern Linux systems apt-get install rabbitmq-server and running with the default settings is adequate in most cases.

To run efficiently Hydra needs a mechanism to take copy on write backups of the output of jobs. The is currently accomplished by adding the fl-cow library to LD_PRELOAD. Experimenting with other approaches such as ZFS or cp --reflink are under consideration.

Many components assume that there is a local user called hydra and that all minion nodes can ssh as that user to each other. This is used most prominently for rsync based replicas. The user hydra is not necessary when running a local-stack environment (see below).

OS X

On OS X several utilities are necessary to run the local-stack environment:

brew install coreutils
brew install wget

Components

While hydra can be used for ad-hoc analysis of csv and other local files, it's most commonly used in a distributed cluster. In that case the following components are involved:

  • ZooKeeper
  • Spawn: Job control and execution
  • Minion: Task runner
  • QueryMaster: Handler for queries
  • QueryWorker: Handle scatter-gather requests from QueryMaster
  • Meshy: File server

A typical configuration is to have a cluster head with Spawn & QueryMaster backed by a homogeneous clusters of nodes running Minion, QueryWorker, and Meshy.

Local Stack

For local development all of the above components can run together in a single stack run out of hydra-local. There is a local-stack.sh script to assist with this. To run the local stack:

  • You must be able to build hydra
  • Have rabbitmq installed
  • Allow your current user to ssh to itself

The first time the script is run a hydra-local directory will be created.

  • ./hydra-uber/bin/local-stack.sh start - start ZooKeeper
  • ./hydra-uber/bin/local-stack.sh start - start spawn, querymaster etc.
  • ./hydra-uber/bin/local-stack.sh seed - add some sample data

You can then navigate to http://localhost:5052/ and you should see the spawn web interface.

When done ./hydra-uber/bin/local-stack.sh stop will stop everything except ZooKeeper, and running stop a second time will bring that process down as well.

There are sample job configurations located in hydra-uber/local/sample/

Administrative

Discussion

Mailing list: http://groups.google.com/group/hydra-oss

Freenode channel: #hydra

Versioning

It's x.y.z where:

  • x: Something Big Happened
  • y: next release
  • z: strive for bug fix only

License

hydra is released under the Apache License Version 2.0. See Apache or the LICENSE file in this distribution for details.

hydra's People

Contributors

tea-dragon avatar abramsm avatar addthis-buildbot avatar stewartoallen avatar mspiegel avatar cburroughs avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.