Datamot simplifies creating hadoop batch processing pipelines and solves most of the infrustructural issues that appear when deploying jobs to Hadoop/Oozie.
Datamot helps you to organize your analytical code into a project and to deploy it to a hadoop cluster. You don't need to copy files to HDFS and substitute variables in configuration. No scripts boilerplate. Just push your code to git as you do it always and let the magic happen.
Automatic start/stop of jobs. Easy, centralized and local configuration of the project. Configuration profiles (production, test, anything). Freedom in project structure.
Totally not. We are far from marketing, we are engineers. Just install the first version of datamot and you'll get 50% discount for the next version.
It's free and it's free for commercial use.
Datamot consists of two parts, cli tool (which you may use, but totally not have to) and super simple web UI. When you run the web UI for the first time it'll create bare git repositories, each per project listed in configuration.
Then you just push your code to that repos and it gets deployed to a cluster. All the cluster configuration is kept in the special conf file in your project. After push you can navigate your project (datasets, coordinators, workflows, code) through the web UI, which greatly helps when troubleshooting code bugs.
Download the distro. Unpack it. Run the datamot/bin/datamot --start
. Go to
http://localhost:2718.
Normally you need to install datamot to some server which is available to push via ssh for you and your teammates. Your hadoop cluster and it's services, like Oozie, HDFS need to be available from this server. And it would be just nice to create a dedicated user on that server to run datamot.
After creating a user (or deciding to use an existing one) unpack datamot
to whatever place you want, and add it's bin
folder to paths. So, if, for
example, you unpack it to /home/datamot/datamot
, then you need to fix $PATH
environment variable like this:
$PATH=/home/datamot/datamot/bin/:$PATH
Now datamot --version
should work from anywhere. If your are going to use UI
(we adwise you to use it) just run datamot --start
. Currently we are working
on installing datamot as a service. For now you can use screen to run detached
datamot like this:
screen -dmS datamot datamot --start
You can change default ip and port on which datamot will listen. Take a look at
conf/datamot.conf
.
remoteUser
is a ssh user under which you are going to push to git,
remoteHost
is a host to push,
these two variables are needed to show push URL in the user interface.
We use it on production, but on a cluster with not so many (but quite heavy) jobs: about 30 coordinators. We are now in early alpha state and need to gather more feedback from users. Looking forward to your comments.
Let's create a simple coordinator...
Ok, there are several workflow managers for Hadoop. One of the best (by our opinion) is Oozie. You know, it's like a container for you scripts (python, pig, hive-sql, whatever) that runs periodically by a scheduler.
It's declarative and perfectly fits to a modern view of immutable architecture. It has all the necessary entities to help you model your calculations: datasets, workflows, coordinators, coordinator bundles (pipelines) and dependencies between them.
You'd better look at the official Oozie docs. It's a great tool! Here you go: http://oozie.apache.org/ .
We'll prepare a toy project for Hadoop/Oozie/Spark/Hive stack with several coordinators and simple toy calculations within them soon. Just clone it and push to the URL shown in the menu of the datamot user interface. But before the push fix .conf file in the root to match your cluster configuration.
Cool! Create an issue!
Use issues. You can join us in Telegram as well.