netflix / genie Goto Github PK

Distributed Big Data Orchestration Service

Home Page: https://netflix.github.io/genie

License: Apache License 2.0

Java 71.89% HTML 0.03% Python 0.35% Shell 0.35% PLpgSQL 0.80% JavaScript 1.30% CSS 1.98% Groovy 23.24% Dockerfile 0.05% Vim Snippet 0.01%

big-data bigdata orchestration configuration configuration-management java spring-boot distributed-systems netflixoss cloud

genie's Introduction

Genie

Introduction

Genie is a federated Big Data orchestration and execution engine developed by Netflix.

Genie’s value is best described in terms of the problem it solves.

Big Data infrastructure is complex and ever-evolving.

Data consumers (Data Scientists or other applications) need to jump over a lot of hurdles in order to run a simple query:

Find, download, install and configure a number of binaries, libraries and tools
Point to the correct cluster, using valid configuration and reasonable parameters, some of which are very obscure
Manually monitor the query, retrieve its output

What works today, may not work tomorrow. The cluster may have moved, the binaries may no longer be compatible, etc.

Multiply this overhead times the number of data consumers, and it adds up to a lot of wasted time (and grief!).

Data infrastructure providers face a different set of problems:

Users require a lot of help configuring their working setup, which is not easy to debug remotely
Infrastructure upgrades and expansion require careful coordination with all users

Genie is designed to sit at the boundary of these two worlds, and simplify the lives of people on either side.

A data scientist can “rub the magic lamp” and just say “Genie, run query ‘Q’ using engine SparkSQL against production data”. Genie takes care of all the nitty-gritty details. It dynamically assembles the necessary binaries and configurations, execute the job, monitors it, notifies the user of its completion, and makes the output data available for immediate and future use.

Providers of Big data infrastructure work with Genie by making resources available for use (clusters, binaries, etc) and plugging in the magic logic that the user doesn’t need to worry about: which cluster should a given query be routed to? Which version of spark should a given query be executed with? Is this user allowed to access this data? etc. Moreover, every job’s details are recorded for later audit or debugging.

Genie is designed from the ground up to be very flexible and customizable. For more details visit the official documentation

Builds

Genie builds are run on Travis CI here.

Branch	Build	Coverage (coveralls.io)
master (4.2.x)
4.1.x
4.0.x

Project structure

`genie-app`

Self-contained Genie service server.

`genie-agent-app`

Self-contained Genie CLI job executor.

`genie-client`

Genie client interact with the service via REST API.

`genie-web`

The main server library, can be re-wrapped to inject and override server components.

`genie-agent`

The main agent library, can be re-wrapped to inject and override components.

`genie-common`, `genie-common-internal`, `genie-common-external`

Internal components libraries shared by the server, agent, and client modules.

`genie-proto`

Protobuf messages and gRPC services definition shared by server and agent. This is not a public API meant for use by other clients.

`genie-docs`, `genie-demo`

Documentation and demo application.

`genie-test`, `genie-test-web`

Testing classes and utilities shared by other modules.

`genie-ui`

JavaScript UI to search and visualize jobs, clusters, commands.

`genie-swagger`

Auto-configuration of Swagger via Spring Fox. Add to final deployment artifact of server to enable.

Artifacts

Genie publishes to Maven Central and Docker Hub

Refer to the demo section of the documentations for examples. And to the setup section for more detailed instructions to set up Genie.

Python Client

The Genie Python client is hosted in a different repository.

Further info

For a detailed explanation of Genie architecture, use cases, API documentation, demos, deployment and customization guides, and more, visit the Genie documentation.

Contact

To contact Genie developers with questions and suggestions, please use GitHub Issues

genie's People

Contributors

Stargazers

Watchers

Forkers

bradau cfregly piaozhexiu tongzhuca jeanlinux curioussalim rmarshasatx lianhuiwang yzzky dreamfrog unixcrh liuzhiyi baeeq kerasking agilemobiledev jackerxff arg0 dyf128 prabinamatya chiehwen evamtse hausma jack19861225 wyingquan irontablee ksumit xuyanhui jmnarloch orangelpai suhyunjeon grze agilee rogerchucker ccstartfish101 aslomibm maheedhargunturu akhedekar kirkgray nuevoalex sinsixx asylumcorp bridiver jhowarth lijinhui pradeepdas seraphliu kruegerb jeffreywugz yuanke chinna1986 peoplemerge thebennos youaani lin1020 libin martin-martinez findhy vdreamakitex ulugbekov cchapati alanliu-ai wattsinabox mbrukman cstollw rspieldenner jasonliaoxiaoge juannfrancisco alisheikh volo mitesh91 davinirjr biddyweb nkhuyu modulexcite zhangbofrank ikenchina leochencipher chen0031 cloudxtreme sensaid kailunshi mindis c3p0hz saurabh105 iloveyou416068 liviutudor kinzer1 oemejia chakra-coder cyburs ashang hardiku dfconroy ankit-rai hussainbohra amitsharma1981 jason790 stitchfix brucezhou2012 vgmartinez

genie's Issues

System can add total execution time to job status

Currently the status of a job shows start and end dates, but I'm not good at date math. Something that shows '27 hours, 42 minues, and 16 seconds' would be very useful.

Failed overwriting files on Hadoop 2

Apparently cp has been changed so that it does cause an error if you are overwriting a file

Copying Hadoop config files...
Copying files file:///home/hadoop/conf/core-site.xml file:///home/hadoop/conf/mapred-site.xml file:///home/hadoop/conf/hdfs-site.xml to file:///mnt/tomcat/genie-jobs/bb8c930d-7a1c-4fd8-b010-a79a153751c2/conf
cp: file:///mnt/tomcat/genie-jobs/bb8c930d-7a1c-4fd8-b010-a79a153751c2/conf/core-site.xml': File exists cp:file:///mnt/tomcat/genie-jobs/bb8c930d-7a1c-4fd8-b010-a79a153751c2/conf/mapred-site.xml': File exists
cp: `file:///mnt/tomcat/genie-jobs/bb8c930d-7a1c-4fd8-b010-a79a153751c2/conf/hdfs-site.xml': File exists

Set up optional limits on the stdout for jobs

Sometimes users tend to do "select * ..", which dumps the results in the stdout. This can cause the Genie nodes to potentially run out of space.

We should add an optional property to Genie to set up a per-job limit. If the outputs of the job exceeds this limit, we should fail the job, rather than have it fill up disk on the node.

Genie ignores schedule if clusterName or clusterId is passed as an empty string

Genie's behavior is to ignore the schedule if clusterName or clusterId is not provided.

However, if the clusterName or ID are passed as empty strings, it ignores the schedule as well - which is incorrect behavior. It should simply default back to using the schedule.

Refactor initialization of persistence and metrics classes to be transparent to clients

Currently the PersistenceManager is expected to be initialized using the init() method. But the constructor also calls the init() method. Hence, this is redundant, and should be cleaned up.

Similarly, the classes in the com.netflix.genie.server.metrics must be similarly cleaned up.

Set up caching for Servo Metrics

Genie reports metrics using Servo via the GenieNodeStatistics class. However, some of the methods are less than ideal - since they make database calls each time they are called.

We should set up a daemon thread to cache the metrics, so that the getters return instantly.

Can't clone genie

Hi,

I get the following when cloning the repository, per 'getting started' doc on OSX 10.8.4:

$ git clone [email protected]:Netflix/genie.git

Cloning into 'genie'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Am I missing something? Let me know - thanks

John

Genie job submission should support a mechanism to specify file dependencies as attachments

Currently file dependencies are passed around as locations on S3 (or any file system that Hadoop supports, that is accessible to the Genie server). This adds an additional step, where clients have to first stage the dependencies, which then gets downloaded to the Genie server.

In addition, Genie should support a mechanism to POST file dependencies as attachments.

Error Running hadoopFSTest.py

I followed the getting started instructions in the Wiki, however, when running the hadoopFSTest.py, I get the following error.

This genie instance doesn't support Hadoop version: 1.1.2

2013-06-24 10:18:20,351 DEBUG com.netflix.genie.server.util.StringUtil:95 [http-bio-8080-exec-8] [trimVersion] Returning canonical version for 1.1.2
2013-06-24 10:18:20,351 DEBUG com.netflix.genie.server.util.StringUtil:112 [http-bio-8080-exec-8] [trimVersion] Canonical version for 1.1.2 is 1.1.2
2013-06-24 10:18:20,351 ERROR com.netflix.genie.server.jobmanager.impl.HadoopJobManager:422 [http-bio-8080-exec-8] [initEnv] This genie instance doesn't support Hadoop version: 1.1.2
2013-06-24 10:18:20,351 ERROR com.netflix.genie.server.services.impl.GenieExecutionServiceImpl:226 [http-bio-8080-exec-8] [submitJob] Failed to submit job:
com.netflix.genie.common.exceptions.CloudServiceException: This genie instance doesn't support Hadoop version: 1.1.2
at com.netflix.genie.server.jobmanager.impl.HadoopJobManager.initEnv(HadoopJobManager.java:423)
at com.netflix.genie.server.jobmanager.impl.HadoopJobManager.init(HadoopJobManager.java:287)
at com.netflix.genie.server.jobmanager.impl.HadoopJobManager.launch(HadoopJobManager.java:134)
at com.netflix.genie.server.services.impl.GenieExecutionServiceImpl.submitJob(GenieExecutionServiceImpl.java:210)
at com.netflix.genie.server.resources.JobResourceV0.submitJob(JobResourceV0.java:123)
at sun.reflect.GeneratedMethodAccessor90.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)

Fix corner case where connections are not released by client on exception

Under most circumstances, the genie base client releases connections after use. But if there is an exception during genieClient.executeWithLoadBalancer(req), connections are not released. This should be fixed.

Deadlock Issue when submitting new jobs to genie

In Genie, the Janitor Thread periodically runs to check jobs which don't have update time changed in the last configured time. At the same time, when a new job is submitted after the initial init in the db, genie tries to run an update query changing status to RUNNING after launching the job.

The above two threads end up in a deadlock, causing the job to remain in INIT state in db, but the process gets launched successfully.

We should retry this update call to see if that fixes the issue.

hadoop/conf vs s3 conf files

I'm confused about the settings in the genie.properties file for the hadoop paths vs the files that *-site.xml files that are downloaded from S3. Since the launcher script copies everything from hadoop/conf, why is it necessary to copy down the s3 files? If you are running genie on it's own server with multiple clusters, do you also need to have hadoop installed locally and is that why you have both?

Improve job destruction to not forward kill request if the job is already done

Currently, when a node receives a request to kill a job, it simply forwards the request to the node where the job is running. The node which is running the job then checks if a job is still running - and if so, kills it.

As a simple optimization, any node should first check for the job status before trying to forward a request - this will not only be faster, but will also safeguard us in case the original node has been terminated/dead.

Support patterns in the cluster/search Name field

It would be great if the cluster/search Name field supported patterns.

Investigate Server Side Job Queues for Genie

Currently when genie is near capacity it auto-scales to max possible and then starts rejecting jobs, with the onus being on the clients to implement retries, throttling, queues etc. We need to investigate if it makes sense to have this feature on the server side in genie. There are potential issues in this case like getting DDOS'ed etc, so this might be tricky but worth thinking about.

Hadoop 2.0 support

The current Genie data model doesn't support Hadoop 2.0, as there is no place to specify yarn-site.xml for a cluster, and also to support application-specific information (such as MR or Tez).

Lets build this support incrementally. As a first step, we can build out support for MR jobs (should be relatively trivial), and then we can enhance the data model to support different applications such as Tez.

Make the host name resolution in the cloud more generic

Currently, Genie uses System.getenv("EC2_PUBLIC_HOSTNAME") to figure out its hostname in the cloud (if the netflix.genie.server.host property is not set).

We should make this more generic but looking at the instance metadata. By default, it is should get the host name from: http://169.254.169.254/latest/meta-data/public-hostname. If that is unavailable (e.g. in a VPC), it should default to http://169.254.169.254/latest/meta-data/local-ipv4.

Enable Python Eureka client to accept an application name via environment variable

The current Eureka test client uses a keyword arg with appName set to "genie" by default.

Provide a backwards-compatible override mechanism to fetch an alternative application name registered with Eureka, which would take lower priority than the keyword argument itself.

Behavior should be as follows:

If appName is set, use it
If appName is not set, then fetch it from environment variable NETFLIX_APP
If it is still not set, default it to "genie"

Email notifications for jobs

Currently one can get the job status from Genie via a GET method - i.e. it is a pull API. Some users prefer to get notified if any of their long running jobs have finished.

We should add an optional email notification to Genie, after a job finishes, reporting its final status.

Are "fileDependencies" added to the classpath?

If I want to bring in external libraries to a Pig script, do I need to use the "register" Pig command, or will external jars brought in through fileDependencies get added to the classpath?

Enable updating cluster status from Genie landing page

Currently the Genie landing page doesn't let us enable changing cluster status. Make it happen!

Genie sandbox

Since Genie has several pre-requisites such as Hadoop, Hive, Pig and Tomcat, and many configurations to tweak, we should create a sandbox (e.g. an AMI, a bootstrap action, a playbook, etc) that our users can instantiate to get Genie working easily out of the box.

Pass genie id to hadoop jobs

Genie currently passes the genie job id to hive/pig jobs, but does not for standard hadoop jobs. This would facilitate tracing between a job and genie through the job config (and the job tracker).

User can search by any field in Jobs

It would be useful to have an "all-field" search API in genie, so applications that query genie can look for any jobs that match the term in job id, job name, username, or tags.

Refreshed landing page for the genie webapp

Currently the webapp is very rudimentary. Searching for jobs and clusters spits on XML. We should enhance this for better usability

Randomize Sleep time between consecutive runs of JobJanitor Thread

Currently, the janitor thread takes a fixed variable and sleeps for that much time between consecutive cleanups. Unfortunately, All the nodes in genie cluster run this thread for now, and with todays logic most of them run the code at the same time. We should introduce some randomness in the sleep time so that they are a little staggered.

Eventually we should look at using ZooKeeper or something to have only one of the nodes in the genie ASG to run the janitor code. I will file a separate issue for that.

disableLogArchival doesn't disable log archival

Returns Not archiving files in working directory even when set to true. The jobrunner.sh does not check the flag.

Add integration tests for rest resources

I'm hoping @jmnarloch can give the new arquillian integration a whirl on a real prod app that uses karyon.

My suggestion would be to make a test that just does a list with no params on https://github.com/Netflix/genie/blob/master/genie-server/src/main/java/com/netflix/genie/server/resources/HiveConfigResourceV0.java#L117 for starters.

Upgrade ribbon dependency to get latest bug fixes

This is related to Issue #43.

The current version of Genie uses an older Ribbon client, which has a bug where connections are not released in all cases. We should upgrade to the latest (0.3.1) release.

The flip side is that Ribbon has changed some internal APIs, and this will necessitate some code changes (including minor changes to the client library itself).

Add counter called Job_Submission_Retry_Count

This counter will be incremented every time we hit a deadlock (refer #67) or some other transient error that forces genie to retry the transaction while job submission.

"Wrong FS" error when copying files during job setup

When submitting a job, I'm getting errors when Genie is copying the config files to the working directory. From the error message, it would see that it expects the working dir to be on HDFS, which seems odd.

I created the cluster and Pig config like:

curl -v http://localhost:7001/genie/v0/config/pig -d '
{ 
  "pigConfig": {
    "name": "pig-0.11.1",
    "type": "test",
    "s3PigProperties": "hdfs://localhost:50001/tmp/pig/conf/pig.properties",
    "user": "mumrah",
    "status": "ACTIVE",
    "pigVersion": "0.11.1",
  }
}' -X POST -H "Content-Type: application/json"



curl -v http://localhost:7001/genie/v0/config/cluster -d '
{
  "clusterConfig": {
      "name": "david-laptop",
      "test": true,
      "user": "mumrah",
      "s3MapredSiteXml": "hdfs://localhost:50001/tmp/hadoop/conf/mapred-site.xml",
      "s3CoreSiteXml": "hdfs://localhost:50001/tmp/hadoop/conf/core-site.xml",
      "s3HdfsSiteXml": "hdfs://localhost:50001/tmp/hadoop/conf/hdfs-site.xml",
      "testPigConfigId": "397a2574-fb3e-4903-bef5-439372d68c7c",
      "adHoc": true,
      "status": "UP"
  }
}' -X POST -H "Content-Type: application/json"

Then I submit the job like:

curl -v http://localhost:7001/genie/v0/jobs -d '
{
  "jobInfo": {
    "jobType": "PIG",
    "schedule": "ADHOC",
    "configuration": "TEST",
    "userName": "mumrah",
    "cmdArgs": "-param foo=bar",
    "fileDependencies": "file:///path/to/my/script.pig"
  }
}' -X POST -H "Content-Type: application/json"

No errors on submission

Here is my cmd.log:

Job Execution Parameters
ARGS = 7
CMD = pig
CMDLINE = -D genie.job.id=9732961b-85b5-4b42-867f-0679f3a10105 -D netflix.environment=null -param foo=bar
CURRENT_JOB_FILE_DEPENDENCIES = file:///path/to/my/script.pig
S3_HADOOP_CONF_FILES = hdfs://localhost:50001/tmp/hadoop/conf/core-site.xml,hdfs://localhost:50001/tmp/hadoop/conf/mapred-site.xml,hdfs://localhost:50001/tmp/hadoop/conf/hdfs-site.xml
S3_HIVE_CONF_FILES = 
S3_PIG_CONF_FILES = hdfs://localhost:50001/tmp/pig/conf/pig.properties
CURRENT_JOB_WORKING_DIR = /tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105
CURRENT_JOB_CONF_DIR = /tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf
S3_ARCHIVE_LOCATION = 
HADOOP_USER_NAME = mumrah
HADOOP_GROUP_NAME = hadoop
HADOOP_HOME = /Users/mumrah/Downloads/hadoop-1.0.4
HIVE_HOME = 
PIG_HOME = /opt/pig-0.11.1
HADOOP_S3CP_TIMEOUT = 1800
Creating job conf dir: /tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf
Copying job dependency files: 
Copying files file:///path/to/my/script.pig to file:///tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105
Warning: $HADOOP_HOME is deprecated.

2013-09-04 16:14:38.203 java[80221:1903] Unable to load realm info from SCDynamicStore
Copying Hadoop config files...
Copying files hdfs://localhost:50001/tmp/hadoop/conf/core-site.xml hdfs://localhost:50001/tmp/hadoop/conf/mapred-site.xml hdfs://localhost:50001/tmp/hadoop/conf/hdfs-site.xml to file:///tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf/
Warning: $HADOOP_HOME is deprecated.

2013-09-04 16:14:39.316 java[80253:1903] Unable to load realm info from SCDynamicStore
cp: Wrong FS: file:/tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf, expected: hdfs://localhost:50001
Usage: java FsShell [-cp <src> <dst>]
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.

2013-09-04 16:14:45.422 java[80279:1903] Unable to load realm info from SCDynamicStore
cp: Wrong FS: file:/tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf, expected: hdfs://localhost:50001
Usage: java FsShell [-cp <src> <dst>]
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.

2013-09-04 16:14:51.561 java[80302:1903] Unable to load realm info from SCDynamicStore
cp: Wrong FS: file:/tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf, expected: hdfs://localhost:50001
Usage: java FsShell [-cp <src> <dst>]
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.

2013-09-04 16:14:57.634 java[80325:1903] Unable to load realm info from SCDynamicStore
cp: Wrong FS: file:/tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf, expected: hdfs://localhost:50001
Usage: java FsShell [-cp <src> <dst>]
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.

2013-09-04 16:15:03.718 java[80348:1903] Unable to load realm info from SCDynamicStore
cp: Wrong FS: file:/tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf, expected: hdfs://localhost:50001
Usage: java FsShell [-cp <src> <dst>]
Will retry in 5 seconds to ensure that this is not a transient error
Failed to copy files from hdfs://localhost:50001/tmp/hadoop/conf/core-site.xml hdfs://localhost:50001/tmp/hadoop/conf/mapred-site.xml hdfs://localhost:50001/tmp/hadoop/conf/hdfs-site.xml to file:///tmp/genie-jobs/9732961b-85b5-4b42-867f-0679f3a10105/conf/
Not archiving files in working directory

Here are the files in the working directory:

.
./cmd.log
./conf
./conf/capacity-scheduler.xml
./conf/configuration.xsl
./conf/core-site.xml
./conf/fair-scheduler.xml
./conf/hadoop-env.sh
./conf/hadoop-metrics2.properties
./conf/hadoop-policy.xml
./conf/hdfs-site.xml
./conf/log4j.properties
./conf/mapred-queue-acls.xml
./conf/mapred-site.xml
./conf/masters
./conf/slaves
./conf/ssl-client.xml.example
./conf/ssl-server.xml.example
./conf/taskcontroller.cfg
./script.pig

So it seems to be copying somethings fine, but some are failing.

Here is my genie.properties:

netflix.appinfo.name=genie

eureka.registration.enabled=false
netflix.datacenter=ndc

## Karyon settings
# You must provide any specific packages that must be scanned for karyon for finding Application and Component classes.
# By default karyon only searches com.netflix package
# The package specified here is the root package, any subpackages under the root will also be scanned.
com.netflix.karyon.server.base.packages=com.netflix.genie.server.startup

# The health check handler for this application.
com.netflix.karyon.health.check.handler.classname=com.netflix.genie.server.health.HealthCheck

# Comment this property if you want to enable eureka integration, and set up eureka-client.properties
com.netflix.karyon.eureka.disable=true

## Implementation of the execution service resource
netflix.genie.server.executionServiceImpl=com.netflix.genie.server.services.impl.GenieExecutionServiceImpl

## Implementation of the configuration service resources
netflix.genie.server.clusterConfigImpl=com.netflix.genie.server.services.impl.PersistentClusterConfigImpl
netflix.genie.server.hiveConfigImpl=com.netflix.genie.server.services.impl.PersistentHiveConfigImpl
netflix.genie.server.pigConfigImpl=com.netflix.genie.server.services.impl.PersistentPigConfigImpl

## Cluster load balancer settings
netflix.genie.server.clusterLoadBalancerImpl=com.netflix.genie.server.services.impl.RandomizedClusterLoadBalancerImpl

## Execution Service system properties

# java home
netflix.genie.server.java.home=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/

# hadoop home for various versions
#netflix.genie.server.hadoop.home=/apps/hadoop/current
#netflix.genie.server.hadoop.1.0.home=/apps/hadoop/1.0
netflix.genie.server.hadoop.1.0.2.home=/opt/hadoop-1.0.2
netflix.genie.server.hadoop.1.0.4.home=/Users/mumrah/Downloads/hadoop-1.0.4
netflix.genie.server.hadoop.home=/Users/mumrah/Downloads/hadoop-1.0.4

# hive home for various versions
#netflix.genie.server.hive.home=/apps/hive/current
#netflix.genie.server.hive.0.8.home=/apps/hive/0.8
#netflix.genie.server.hive.0.8.1.home=/apps/hive/0.8

# pig home for various versions
#netflix.genie.server.pig.home=/apps/pig/current
#netflix.genie.server.pig.0.9.home=/apps/pig/0.9
#netflix.genie.server.pig.0.9.2.home=/apps/pig/0.9
netflix.genie.server.pig.0.9.2.home=/opt/pig-0.9.2
netflix.genie.server.pig.0.11.1.home=/opt/pig-0.11.1
netflix.genie.server.pig.home=/opt/pig-0.11.1

# REST URI for the execution service
netflix.genie.server.job.resource.prefix=genie/v0/jobs

# If this is commented, Genie will use InetAddress.getLocalHost() in DC,
# or the environment variable EC2_PUBLIC_HOSTNAME in the cloud
# netflix.genie.server.host=localhost

# Location of genie scripts - currently part of genie-web/conf/system/apps/genie/bin
# Change this appropriately to point to above location
netflix.genie.server.sys.home=/Users/mumrah/Code/github/genie/genie-web/conf/system/apps/genie/bin

# The relative path for the prefix directory inside Tomcat's "webapps/ROOT" 
# that Genie will use for its working directory
# This is usually sym-linked to netflix.genie.server.user.working.dir
netflix.genie.server.job.dir.prefix=genie-jobs

# Actual location where output directories are created
# Create a symlink from the above directory to here
# Or simply provide the full path of the above directory
netflix.genie.server.user.working.dir=/tmp/genie-jobs

# The timeout after which an S3 copy will be aborted
netflix.genie.server.hadoop.s3cp.timeout=1800
# Other properties required to make S3 copies work
netflix.genie.server.hadoop.s3cp.opts=-Dfs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem \
    -Dfs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem \
    -Dfs.s3.awsAccessKeyId=KEY \
    -Dfs.s3.awsSecretAccessKey=SECRET \
    -Dfs.s3n.awsAccessKeyId=KEY \
    -Dfs.s3n.awsSecretAccessKey=SECRET

# if set, archive logs to this location
# netflix.genie.server.s3.archive.location=s3://SOME_S3_PREFIX

## Configuration for janitor thread, which cleans up zombie jobs
# sleep time between iterations
netflix.genie.server.janitor.sleep.ms=300000
# the time delta older than which unupdated jobs are marked as zombies
netflix.genie.server.janitor.zombie.delta.ms=1800000

## Properties for job throttling/forwarding
# max running jobs on this instance, after which 503s are thrown
netflix.genie.server.max.running.jobs=30
# number of running jobs on instance, after which to start forwarding to other instances
# to disable auto-forwarding of jobs, set this to higher than netflix.genie.server.max.running.jobs
netflix.genie.server.forward.jobs.threshold=20
# find an idle instance with fewer running jobs than current, by this delta
# e.g. if netflix.genie.server.forward.jobs.threshold=20, forward jobs to an instance
# with number of running jobs < (20-netflix.genie.server.idle.host.threshold.delta)
netflix.genie.server.idle.host.threshold.delta=5
# max running jobs on instance that jobs can be forwarded to
netflix.genie.server.max.idle.host.threshold=27

# if uncommented, the job will be killed if the size of its stdout is greater than the limit
# netflix.genie.job.max.stdout.size=8589934592

Any clues?

Timeouts

At the end of all of our jobs we see the following in the stderr.log

2014-06-20 15:59:56,791 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2014-06-20 15:59:57,886 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: ip-10-0-1-33.ec2.internal/10.0.1.33:34559. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-06-20 15:59:58,887 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: ip-10-0-1-33.ec2.internal/10.0.1.33:34559. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-06-20 15:59:59,888 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: ip-10-0-1-33.ec2.internal/10.0.1.33:34559. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

It is definitely not a connection issue, when you hit those ports there is a series of redirects that ends in a non-existent page

failed copying hadoop conf files from S3

cat /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/cmd.log

Job Execution Parameters
ARGS = 4
CMD = hadoop
CMDLINE = fs -ls /
CURRENT_JOB_FILE_DEPENDENCIES =
S3_HADOOP_CONF_FILES = file://Library/hadoop/hadoop-1.1.2/conf/core-site.xml,file://Library/hadoop/hadoop-1.1.2/conf/mapred-site.xml,file://Library/hadoop/hadoop-1.1.2/conf/hdfs-site.xml
S3_HIVE_CONF_FILES =
S3_PIG_CONF_FILES =
CURRENT_JOB_WORKING_DIR = /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc
CURRENT_JOB_CONF_DIR = /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf
S3_ARCHIVE_LOCATION =
HADOOP_USER_NAME = genietest
HADOOP_GROUP_NAME = hadoop
HADOOP_HOME = /Library/hadoop/hadoop-1.1.2
HIVE_HOME = /apps/hive/current
PIG_HOME = /apps/pig/current
HADOOP_S3CP_TIMEOUT = 1800
Creating job conf dir: /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf
Copying job dependency files:
Copying Hadoop config files...
Copying files file://Library/hadoop/hadoop-1.1.2/conf/core-site.xml file://Library/hadoop/hadoop-1.1.2/conf/mapred-site.xml file://Library/hadoop/hadoop-1.1.2/conf/hdfs-site.xml to /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf
Warning: $HADOOP_HOME is deprecated.

2013-06-24 12:42:37.030 java[70743:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.

2013-06-24 12:42:43.183 java[70769:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.

2013-06-24 12:42:49.333 java[70794:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.

2013-06-24 12:42:55.491 java[70825:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.

2013-06-24 12:43:01.662 java[70850:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Failed to copy files from file://Library/hadoop/hadoop-1.1.2/conf/core-site.xml file://Library/hadoop/hadoop-1.1.2/conf/mapred-site.xml file://Library/hadoop/hadoop-1.1.2/conf/hdfs-site.xml to /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf
Not archiving files in working directory

conf Directory Listing After Job Fails:

tsunami:09376457-596a-48ac-9e91-1cdf5cdbbffc schappetj$ ls -1 conf/
capacity-scheduler.xml
configuration.xsl
core-site.xml
fair-scheduler.xml
hadoop-env.sh
hadoop-metrics2.properties
hadoop-policy.xml
hdfs-site.xml
log4j.properties
mapred-queue-acls.xml
mapred-site.xml
masters
slaves
ssl-client.xml.example
ssl-server.xml.example
taskcontroller.cfg

Updating Entity (Application, Command, etc) Clears Non-System Tags

If you create an entity with custom tags beyond the system generated tags and then attempt to update other fields of the entity via PUT call the custom tags will disappear. They should not be cleared out unless explicitly cleared.

Update job search form on Genie console to also search by cluster name and id

Now that the API support searching by cluster id and name (Issue #29), we should add the same feature to the console/UI as well.

System can record more metrics for the persistence layer

Currently, there is no visibility into success and error dates for methods in the PersistenceManager. We should add some Servo metrics to improve visibility into the persistence layer.

For starters, we can add metrics for successful and failed queries.

System can elect a leader to run periodic tasks

part of #101

Today every node in Genie is running the janitor thread responsible for marking non updated jobs in genie as zombies. The more a Genie ASG scales, this would start becoming an issue with all the nodes hammering the DB. We should use Zookeeper or something similar to elect a single node that is responsible for doing this task.

Genie should have a Jenkins build on CloudBees

Like other Netflix OSS projects, Genie should have a CloudBees build

https://netflixoss.ci.cloudbees.com/

This will help simplify Genie installation.

Add priority criteria list support to genie to select cluster

Today genie uses three criteria to select the cluster on which to run the job, effectively in the following precedence :

cluster_id
cluster_name
schedule

If the above criteria return multiple clusters it uses its simple load balancer to pick one (randomly as of now).

We should add a more flexible cluster choosing logic to genie, and this can be done by specifying a priority list of criteria to genie. Genie would simply go down the list till it finds a criteria which provides it valid cluster to run on. If any criteria results in multiple clusters, just like today it will pick one using the load balancer.

An example of the list will look like this:

{
{criteria_name: cluster_id, critera_value: [cluster_1, cluster_2]},
{criteria_name: cluster_name, critera_value: [query, prod]},
{criteria_name: cluster_id, critera_value: cluster_3},
{criteria_name: schedule, critera_value: [adhoc, SLA]},
{criteria_name: schedule, critera_value: bonus},
{criteria_name: cluster_id, critera_value: cluster_4},
}

Genie will pick one element at a time from the list and find the cluster it can run the job on using the criteria. If it cannot find any clusters that are UP it will move to the next. Note, that once a valid cluster is selected, even if job submission has other issues, the remaining elements in the priority list will be ignored.

When we do add this feature, in order to maintain backward compatibility, whatever cluster genie chooses to run the job on, it will update the jobinfo element with the cluster_id, cluster_name and schedule to support the current Get API calls. Also, genie will continue supporting existing criteria, in case the priority list is not specified. This might change in the v1 api if we ever migrate to it which will not be backwards compatible.

Thanks,
Amit

Pass lipstick.uuid from Genie to Lipstick

If we are using Lipstick to run Hive/Pig jobs, then it is useful to use the same UUID for both Genie and Lipstick (from a usability standpoint). We can make this happen by passing a lipstick.uuid property through.

System can record metrics for the job launcher

The job launcher script is responsible for staging dependencies, running the queries, and then archiving the logs, if needed. There are retries built in for the staging/archiving, but there is no visibility into any kind of transient failures. It is desirable to expose these as Servo metrics.

However, since the launcher runs as a separate process, it is a bit tricky it pass the metrics back to the Genie server. One possible option is to write a servo "side-car" service that the launcher can connect to, and publish its relevant metrics.

Genie and its client shouldn't depend on jersey-bundle

From the description of the jersey-bundle artifact:

"Such a bundle is only intended for developers that do not use Maven's dependency system."

The current version of Genie also has a transitive dependency on jersey-bundle due to its use of ribbon-eureka and eureka-client. This is now fixed as by issue Netflix/eureka#53.

NoSuchElementException if netflix.genie.job.max.stdout.size is commented

If netflix.genie.job.max.stdout.size is not set, jobs fail with an HTTP 500, and the following exception is reported by Tomcat:

java.util.NoSuchElementException: 'netflix.genie.job.max.stdout.size' doesn't map to an existing object
at org.apache.commons.configuration.AbstractConfiguration.getLong(AbstractConfiguration.java:862)
at com.netflix.genie.server.jobmanager.impl.JobMonitor.(JobMonitor.java:91)
at com.netflix.genie.server.jobmanager.impl.HadoopJobManager.launch(HadoopJobManager.java:182)
at com.netflix.genie.server.services.impl.GenieExecutionServiceImpl.submitJob(GenieExecutionServiceImpl.java:210)
at com.netflix.genie.server.resources.JobResourceV0.submitJob(JobResourceV0.java:123)

The summary is that the getLong(...) method throws an exception if the property is not set - we need it to default to null.

System can log actual exception in logging messages

In certain places in the code genie does not log the actual exception but just the error message. One example is in the HadoopJobManager.java

} catch (IOException e) {
String msg = "Unable to copy attachment correctly: " + attachment.getName();
logger.error(msg);
throw new CloudServiceException(
HttpURLConnection.HTTP_INTERNAL_ERROR, msg);
}

Improve cleanup logic for job directories created by Genie

Currently the clean up process looks for all files in a directory and deletes it if its older than n days. We should modify it to the following logic:

Create Done file when genie job finishes
Clean up looks at all directories for done files
If done file exists it is deleted.

Fix NPE when jobInfo is null during job execution

If a client submits a job request to Genie, and the jobInfo object is null, Genie returns a 500 error with a null pointer exception. It should check if the request is empty and return a HTTP_BAD_REQUEST.

Hadoop jobs fail with HTTP 500 error if Hive/Pig homes are not set in properties file

The entire stack trace is:
java.lang.NullPointerException
at java.lang.ProcessEnvironment.validateValue(ProcessEnvironment.java:102)
at java.lang.ProcessEnvironment.access$400(ProcessEnvironment.java:44)
at java.lang.ProcessEnvironment$Value.valueOf(ProcessEnvironment.java:185)
at java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:224)
at java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:203)
at java.util.AbstractMap.putAll(AbstractMap.java:256)
at com.netflix.genie.server.jobmanager.impl.HadoopJobManager.launch(HadoopJobManager.java:168)
at com.netflix.genie.server.services.impl.GenieExecutionServiceImpl.submitJob(GenieExecutionServiceImpl.java:210)
at com.netflix.genie.server.resources.JobResourceV0.submitJob(JobResourceV0.java:123)

Basically, the put fails on the map if the values are null. Hive/Pig homes should be optional - only required if Genie is set up for Hive or Pig.

GET requests on job resource should allow filtering by cluster name and id

Currently, GET requests to the job resource don't allow filtering by cluster name and id. This would be a simple, but useful enhancement.

Provide a parameter to disable log archiving

Currently, Genie archives all logs to an S3 location if the property netflix.genie.server.s3.archive.location is set.

We should provide an option in the job request to optionally disable this (for faster turnaround, if a user desires).

Updating Cluster Status Via UI Fails

When you try to update the cluster status via the UI it returns and error message about "clusters" not being a valid entry:

Error!
PUT: genie/v2/config/clusters/{id}
400: Bad Request
Unrecognized field "clusters" (class com.netflix.genie.common.model.Cluster), not marked as ignorable (10 known properties: "created", "user", "name", "configs", "status", "clusterType", "version", "tags", "id", "updated"]) at [Source: com.netflix.server.base.RequestWrapper$ServletInputStreamImpl@5a499fb6; line: 1, column: 14](through reference chain: com.netflix.genie.common.model.Cluster["clusters"])

This appears to be issue where still sending a mix of Genie 1 and Genie 2 payload to the server.

netflix / genie Goto Github PK

genie's Introduction

Genie

Introduction

Builds

Project structure

genie-app

genie-agent-app

genie-client

genie-web

genie-agent

genie-common, genie-common-internal, genie-common-external

genie-proto

genie-docs, genie-demo

genie-test, genie-test-web

genie-ui

genie-swagger