pachyderm / pachyderm Goto Github PK

Data-Centric Pipelines and Data Versioning

License: Apache License 2.0

Go 84.37% Makefile 0.29% Shell 1.40% Dockerfile 0.04% Python 3.69% Jupyter Notebook 0.29% CSS 0.30% HTML 0.09% Mustache 0.82% Jsonnet 0.28% JavaScript 1.59% TypeScript 2.28% Smarty 0.02% Starlark 4.55%

go pachyderm docker analytics big-data containers distributed-systems kubernetes data-science data-analysis

pachyderm's Introduction

Pachyderm – Automate data transformations with data versioning and lineage

Pachyderm is cost-effective at scale, enabling data engineering teams to automate complex pipelines with sophisticated data transformations across any type of data. Our unique approach provides parallelized processing of multi-stage, language-agnostic pipelines with data versioning and data lineage tracking. Pachyderm delivers the ultimate CI/CD engine for data.

Features

Data-driven pipelines automatically trigger based on detecting data changes.
Immutable data lineage with data versioning of any data type.
Autoscaling and parallel processing built on Kubernetes for resource orchestration.
Uses standard object stores for data storage with automatic deduplication.
Runs across all major cloud providers and on-premises installations.

Getting Started

To start deploying your end-to-end version-controlled data pipelines, run Pachyderm locally or you can also deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

Community

Keep up to date and get Pachyderm support via:

Follow us on Twitter.
Join our community Slack Channel to get help from the Pachyderm team and other users.

Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

pachyderm's People

Contributors

Stargazers

Watchers

Forkers

rtvt123 alemagnani bharathy89 suensummit others-cdrone ngel12 narendran atbrox artiya4u nightwolfzor rhydomako mindis zakuru lucentcosmos ybbkrishna zofuthan jwayong2 resurgo-genetics brendanashworth mistobaan arberya leochencipher dimshadowww phacops perfmjs shrikant-aher bitwiseman plar bboalimoe wideioltd andeya mjpan purval miqui klucar sr wxdublin bereal alexdebrie huskyeder xiaonuogantan antobiotics nakuljavali teodor-pripoae tadhunt pisalprasad123 fausto-svds angelabier1 tottok-ug manikandan-selvaganesh alamehor tv42 jnevin arboratum-open rusenask mwaaas cxxly bnlcr nikolayvoronchikhin hsaputra neocortex jbowles recursionbane mbrukman elsonrodriguez lanzafame joshng gabber12 bussiere bigr-lab linearregression stelukutla akarve anchal-agrawal chaibapchya the-gingerbread-man saakaifoundry danielcnorris angl samuell jerluc erniewho wavelets soheilsalehian snowsky benjaminpkane shengjieluo schevalier caidongyun erikreppel msteffen lernaeanhydra fnet123 benjamesbabala jdschmitz claudiouzelac maxknee containerz ukd1 html5cat

pachyderm's Issues

Problems with examples/WordCount.md

I understand it in principle.

Getting it to work, not so much!

Problem 1: "The data you want to analyze is stored at pfs://data"

This seems a novel way to express the location and I don't see this syntax elsewhere.

I've tried

curl -XPOST localhost/data/shakespeare

and

curl -XPOST localhost/file/data/shakespeare

It's unclear whether there's a need|command to create directories. Or indeed whether this is what "pfs://data" represents.

Problem 2: I don't get what I expect from wildcard listings

...and so can't debug using them.

curl -XPOST localhost/file/shakespeare -T shakespeare.txt
curl localhost/file/*

yields

--54c0e46d0daa62501c32dcb1192fab024dced5c79694e6fc636e65dbc6b4
Content-Disposition: form-data; name=".meta"; filename=".meta"
Content-Type: application/octet-stream

read /var/lib/pfs/vol/data-0-3/master/.meta: is a directory

--54c0e46d0daa62501c32dcb1192fab024dced5c79694e6fc636e65dbc6b4--
--aacee4ccb8343ee523f6756c9682f72a7921929a985e0c1e860fa976532d
Content-Disposition: form-data; name=".meta"; filename=".meta"
Content-Type: application/octet-stream

read /var/lib/pfs/vol/data-1-3/master/.meta: is a directory

--aacee4ccb8343ee523f6756c9682f72a7921929a985e0c1e860fa976532d--
--b00326203901e2f2349f8d6c7e945c342c85f4957327a9b8bbf418662099
Content-Disposition: form-data; name=".meta"; filename=".meta"
Content-Type: application/octet-stream

read /var/lib/pfs/vol/data-2-3/master/.meta: is a directory

--b00326203901e2f2349f8d6c7e945c342c85f4957327a9b8bbf418662099--

Problem 3: Installing the pipeline seems too easy to confuse but... ;-)

I either get a 404 which would make sense but not for words like "the".

Or the command blocks (seemingly) indefinitely:

curl localhost/pipeline/wordcount/file/counts/the?commit=commit03

Problem 4: The documentation references jobs

But the Wordcount example appears to be just a Dockerfile spec. Should I be wrapping it in a job?

Problem 5: I don't know how to debug

I was expecting to see an additional docker container(s) spun up to run the pipeline but don't.

Deploy files need to be changed to add the docker socket as a volume

Materialized commits don't report errors nicely.

When there's an error materializing a commit the server logs the error but it doesn't report it back to the user nicely over http.

Minimize the size of our docker container

Our container's size is starting to be an annoyance. AWS would come up way quicker if we could reduce its size.

This shouldn't be too hard, our whole server can pretty easily be made in to a static binary.

Add distributed shell

I often find myself wanting to run grep and wc on stuff in the filesystem or the output of jobs for simple adhoc analysis. It shouldn't be too hard to add something simple for grep and hopefully it will pave the way for more sophisticated repl tools.

Simple map-reduce not working

I wanted to create a simple nodejs map-reduce thingy using pachyderm. It's in my playground repo: https://github.com/neojski/pachyderm-playground. For some reason the reduce part doesn't work. At the
beginning I thought the reason was that I used to have map and reduce in the same container but changing that didn't help.

Unfortunately chess demo doesn't run reduce at all.

I think that a simple working demo with map and reduce would be very useful.

Log below:

master.go:290: Listening on port 80...
master.go:291: dataRepo: data-0-1, compRepo: comp-0-1.
master.go:241: URL in job handler:
/job/countMap
master.go:249: URL with reset path:
/file/job/countMap
master.go:241: URL in job handler:
/job/countReduce
master.go:249: URL with reset path:
/file/job/countReduce
shell.go:12: RunStderr:
&{/bin/btrfs [btrfs property set /var/lib/pfs/vol/data-0-1/master ro true] []  <nil> <nil> <nil> [] <nil> <nil> <nil> ?reflect.Value? false [] [] [] [] <nil>}
shell.go:12: RunStderr:
&{/bin/btrfs [btrfs subvolume snapshot /var/lib/pfs/vol/data-0-1/0a624d53-695d-4780-98eb-60b11c9e513c /var/lib/pfs/vol/data-0-1/master] []  <nil> <nil> <nil> [] <nil> <nil> <nil> ?reflect.Value? false [] [] [] [] <nil>}
mapreduce.go:425: Materialize: data-0-1 master 0a624d53-695d-4780-98eb-60b11c9e513c comp-0-1 job.
mapreduce.go:457: Make docker client.
mapreduce.go:476: Jobs: [0xc20801afa0 0xc20801b090]
mapreduce.go:482: Running job:
countMap
mapreduce.go:497: Job: {map count_input neojski/pachyderm-count-map [/bin/map] 5 5 1200}
mapreduce.go:48: spinupContainer neojski/pachyderm-count-map [/bin/map]
mapreduce.go:482: Running job:
countReduce
mapreduce.go:497: Job: {reduce job/countMap neojski/pachyderm-count-reduce [/bin/reduce] 5 5 1200}
mapreduce.go:48: spinupContainer neojski/pachyderm-count-reduce [/bin/reduce]
mapreduce.go:59: starting container
mapreduce.go:59: starting container
mapreduce.go:40: container started
mapreduce.go:329: Reduce: {reduce job/countMap neojski/pachyderm-count-reduce [/bin/reduce] 5 5 1200} countReduce 
mapreduce.go:88: Retrying due to error: Get http://172.17.42.1:4001/v2/keys/pfs/master?quorum=false&recursive=true&sorted=false: dial tcp 172.17.42.1:4001: connection refused
mapreduce.go:40: container started
mapreduce.go:224: book1.txt: Posting: http://172.17.0.4/book1.txt
mapreduce.go:226: book1.txt: Post done.
mapreduce.go:88: Retrying due to error: Post http://172.17.0.4/book1.txt: dial tcp 172.17.0.4:80: connection refused
mapreduce.go:224: book1.txt: Posting: http://172.17.0.4/book1.txt
mapreduce.go:226: book1.txt: Post done.
mapreduce.go:235: book1.txt: Creating file comp-0-1/master/countMap/book1.txt
mapreduce.go:242: book1.txt: Opened outfile.
mapreduce.go:257: book1.txt: Copying output...
mapreduce.go:262: book1.txt: Done copying.
mapreduce.go:263: Result of timer.Stop(): true
mapreduce.go:88: Retrying due to error: Get http://10.1.42.1:4001/v2/keys/pfs/master?quorum=false&recursive=true&sorted=false: dial tcp 10.1.42.1:4001: i/o timeout

It might just be the case that map returns wrong value but I don't see any documentation for that.

Also, I think that you should describe how to prepare docker images. I've seen that you mentioned somewhere that they should be http servers. What about ports? Can I have multiple maps (or map + reduce) in one container?

Improve documentation

In file https://github.com/pachyderm/pfs/blob/master/README.md it wasn't very clear that pfs in

 curl -XPOST pfs/file/<file>?branch=<branch> -T local_file

Means the host full address of the running pachyderm instance (including port number). Of course it is easy to guess that from https://github.com/pachyderm/chess/blob/master/install/pachyderm/local but it might be worth improving.

chess example problem

Tried setting up the chess example on a 2-node coreos cluster - any idea what's wrong here?

The nodes were called coreos-1 and coreos-2. I can reproduce this at will with a script iyi.

core@coreos-1 ~ $  export SHUTIT_BACKUP_PS1_oO07eN35=$PS1 && PS1='SHUTIT_TMP#oYCuhA5v>' && unset PROMPT_COMMAND && stty cols 320
SHUTIT_TMP#oYCuhA5v>wget -qO- https://github.com/pachyderm-io/pfs/raw/master/deploy/static/1Node.tar.gz | tar -zxf -
SHUTIT_TMP#oYCuhA5v>fleetctl start 1Node/*
Triggered global unit router.service start
Unit master-0-1.service launched on 4a854405.../10.132.128.22
Unit announce-master-0-1.service launched on 4a854405.../10.132.128.22
SHUTIT_TMP#oYCuhA5v>fleetctl list-machines
MACHINE     IP      METADATA
4a854405... 10.132.128.22   -
7de2866d... 10.132.129.103  -
SHUTIT_TMP#oYCuhA5v>fleetctl list-fleetctl list-units^C
SHUTIT_TMP#oYCuhA5v>fleetctl list-units
UNIT                MACHINE             ACTIVE  SUB
announce-master-0-1.service 4a854405.../10.132.128.22   active  running
master-0-1.service      4a854405.../10.132.128.22   active  running
router.service          4a854405.../10.132.128.22   active  running
router.service          7de2866d.../10.132.129.103  active  running
SHUTIT_TMP#oYCuhA5v>git clone https://github.com/pachyderm/chess.git
Cloning into 'chess'...
remote: Counting objects: 22808, done.
remote: Compressing objects: 100% (22781/22781), done.
remote: Total 22808 (delta 27), reused 22798 (delta 17)
Receiving objects: 100% (22808/22808), 14.10 MiB | 6.07 MiB/s, done.
Resolving deltas: 100% (27/27), done.
Checking connectivity... done.
Checking out files: 100% (22757/22757), done.
SHUTIT_TMP#oYCuhA5v>cd chess
SHUTIT_TMP#oYCuhA5v>cd data/
SHUTIT_TMP#oYCuhA5v>./send_sample chess
SHUTIT_TMP#oYCuhA5v>curl pfs/file/chess
curl: (6) Couldn't resolve host 'pfs'
SHUTIT_TMP#oYCuhA5v>curl -XGET localhost/job/chess
Get http://coreos-1:53442/job/chess: dial tcp: lookup coreos-1: no such host
SHUTIT_TMP#oYCuhA5v>ping !$
ping coreos-1
ping: unknown host coreos-1
SHUTIT_TMP#oYCuhA5v>hostname
coreos-1
SHUTIT_TMP#oYCuhA5v>curl -XGET 10.132.128.22/job/chess
Get http://coreos-1:53442/job/chess: dial tcp: lookup coreos-1: no such host
SHUTIT_TMP#oYCuhA5v>ping coreos-1
ping: unknown host coreos-1
SHUTIT_TMP#oYCuhA5v>cat /etc/resolv.conf 
# This file is managed by systemd-resolved(8). Do not edit.
#
# Third party programs must not access this file directly, but
# only through the symlink at /etc/resolv.conf. To manage
# resolv.conf(5) in a different way, replace the symlink by a
# static file or a different symlink.

nameserver 8.8.8.8
nameserver 2001:4860:4860::8844
nameserver 2001:4860:4860::8888
SHUTIT_TMP#oYCuhA5v>cat /etc/hosts
cat: /etc/hosts: No such file or directory
SHUTIT_TMP#oYCuhA5v>docker ps -a
CONTAINER ID        IMAGE                  COMMAND                CREATED             STATUS                      PORTS                   NAMES
471c7d4a8637        pachyderm/pfs:latest   "/go/bin/master 0-1"   12 minutes ago      Up 12 minutes               0.0.0.0:53442->80/tcp   master-0-1             
ddf8422b9a4e        pachyderm/pfs:latest   "mkfs.btrfs /var/lib   12 minutes ago      Exited (0) 12 minutes ago                           desperate_archimedes   
2a231f90f6ae        pachyderm/pfs:latest   "/go/bin/router 1"     12 minutes ago      Up 12 minutes               0.0.0.0:80->80/tcp      router                 
1db59b30d70d        pachyderm/pfs:latest   "truncate /var/lib/p   12 minutes ago      Exited (0) 12 minutes ago                           suspicious_meitner     
SHUTIT_TMP#oYCuhA5v>logout

Support Layers

Be smarter about blocking for pipelines

Take for example the request:

curl localhost/pipeline/foo/file/bar?commit=c1

this will block until the pipeline foo is run on c1 which is a fine behavior. However there are a bunch of ways this can go wrong.

if c1 never exists (i.e. someone made a typo) it will block indefinitely.
if the pipeline errors it will block indefinitely
there are probably other reason it will block indefinitely

Blocking indefinitely isn't something we should do.

Cannot support Docker 1.6.0

https://github.com/pachyderm/pfs/blob/master/scripts/launch#L31-L35

According to the docker version check, pfs only supports docker1.5.0 now. Can I install pfs on top of Docker 1.6.0 by manually removing that docker version check?

Document the pfs net protocol

One of the more annoying issues in working with HDFS in the early days was that the net protocol was entirely undocumented. Things improved a little bit with Hadoop 2, when the protocol was switched over to Protobuf objects, but protocol semantics are still mostly only documented by code.

It would be great if pfs documented the client<->server network protocol and semantics early on. Thanks!

Dockerized Map Reduce

Dockerized Map Reduce (DMR) allows submitting jobs as http servers in Docker containers.

would love a getting started tutorial

I started dev'ving on a mac a few years ago - not completely sure why, but iMessage is fun haha. But ya, pfs overall needs a good "getting started" that's universal, and with assumptions, ie

this is the operating system/things to install that we recommend for developing
this is how you start a local cluster and what you need installed
here's some vagrant files/docker files/etc we'll keep updated

would help a ton for a lot of people i bet

Dedupe code in tests

There are a few convenience functions in tests that I've started copy and pasting between them. We should factor them out to a common package of some sort.

pre-pulling images for new machines

When you add a new machine to the cluster, it should look at the /jobs path and pull down all required images from the Docker Registry. You don't want to slow down the job when it runs for the first time because new machines need to pull lots of images.

Wildcards in paths

Allow making requests such as:

curl localhost/pfs/foo/*

which gets all of the files in a directory. This could conceivably also have a recursive parameter added to it but let's leave that out for 0.2.

Deploying with mesos

Pfs currently only supports deploying on CoreOS but in keeping with our batteries included but removable approach pfs is designed to be as loosely coupled to the platform as possible thus supporting deployment via mesos should be easy.

All that pfs needs for its deployment is:

a way to schedule containers
a service discovery mechanism like etcd so that the router can find masters

I don't have a ton of experience with Mesos but I'm pretty sure both of those things are well within its core feature set so it should be very doable.

Job output error

Hello guys,

Firstly congrats for this idea, it's brilliant !
Then, I started using it, and tried to deploy a small PFS on AWS CloudFormation, doing my Docker container and stuff and do a Job.

When I try to get the result, here is what I get :

curl -XGET 127.0.0.1/job/test_map/file/f5e1d55937?commit=a335b6db-4a58-4b79-a414-8c1da269c4ce
Failed request (500 Internal Server Error) to http://ip-172-X-X-X.us-west-2.compute.internal:53442/job/test_map/file/f5e1d55937?commit=a335b6db-4a58-4b79-a414-8c1da269c4ce.

It should use 127.0.0.1 but it's using my hostname somehow ...

Am I doing anything wrong or is it a bug ?
Cheers,

Branching

Allow creation of light-weight branches in the file system.

API

curl [-XPOST, -XPUT] localhost/branch/name?commit=<commit>

using overlayfs?

Hey! I just started looking at pfs and noticed it is using btrfs. I was curious what features it uses and if it could possibly use overlayfs instead. Happy to discuss via email too.

Figure out how to allow deletion of commits

It currently has a skip in it. The problem is that by deleting intermediate commits we destroy the btrfs lineage information that we need to reconstruct pulls. The good news is that this isn't exposed to users right now because we don't expose deleting intermediate commits. Long term we'll have to figure out a solution to this since we definitely want to expose that functionality.

Only run specific jobs with commit?run

right now, /commit?run will run all jobs in that branch. Instead, you should be able to pass an optional argument to only run specific jobs. This is particularly useful when you're developing a job and want to test it. It'd be nice to only run that job and its dependencies instead of the entire DAG.

Proposed API:
all jobs: /commit?run
single job: /commit?run=job1
multiple jobs: /commit?run=jobs&run=job2

Merge design discussion

Pfs currently has branching which is very similar to git's but doesn't have a concept of merging. It's not immediately obvious what merge maps to in pfs. One obvious issue is that big data merge conflicts sound pretty miserable. The only idea I have for how to solve that is to let users submit a container as a merging function.

File browser

Add and UI to browse files in pfs

Don't try to send the .meta directories when using *s

curl localhost/file/*

yields

--54c0e46d0daa62501c32dcb1192fab024dced5c79694e6fc636e65dbc6b4
Content-Disposition: form-data; name=".meta"; filename=".meta"
Content-Type: application/octet-stream

read /var/lib/pfs/vol/data-0-3/master/.meta: is a directory

--54c0e46d0daa62501c32dcb1192fab024dced5c79694e6fc636e65dbc6b4--
--aacee4ccb8343ee523f6756c9682f72a7921929a985e0c1e860fa976532d
Content-Disposition: form-data; name=".meta"; filename=".meta"
Content-Type: application/octet-stream

read /var/lib/pfs/vol/data-1-3/master/.meta: is a directory

--aacee4ccb8343ee523f6756c9682f72a7921929a985e0c1e860fa976532d--
--b00326203901e2f2349f8d6c7e945c342c85f4957327a9b8bbf418662099
Content-Disposition: form-data; name=".meta"; filename=".meta"
Content-Type: application/octet-stream

read /var/lib/pfs/vol/data-2-3/master/.meta: is a directory

--b00326203901e2f2349f8d6c7e945c342c85f4957327a9b8bbf418662099--

Should be a pretty quick fix.

not an issue

Hi,

Can you give some insight in term of how "distributed" part works? I know it is in the code, but would like to have some pointers.

Thanks.

Xiaoyun

Stop hardcoding docker socket

Right now the docker socket is hardcoded which is a portability issue. It should be an environment variable.

Grep for: "unix:///var/run/docker.sock" to find the places we hardcode it.

Noob question: does this work on a Mac?

Eager to try pachyderm out, this is what I'm running into:

on a mac, using boot2docker, in a working docker environment:

running "curl www.pachyderm.io/launch | sh"
results in "Please install btrfs-tools. (apt-get install btrfs-tools)"

however, typing "sudo apt-get install btrfs-tools" produces:

Reading Package Lists... Done
Building Dependency Tree... Done
E: Couldn't find package btrfs-tools

I experimented with runing a vanilla ubuntu container, and from there I was able to install btrfs-tools. But then I have to add a docker installation to the container, which leads to weird port errors that are probably due to docker in docker strangeness, and which are beyond my ability to debug.

Apologies for putting this usage question to the git issues....I don't know a better place short of emailing the pachyderm project leaders.

Anyways, very cool project.

Discussion thread for http API design

Currently the API exposes 3 endpoints:

/file
/commit
/branch

They all behave somewhat similarly in that they use POST/PUT for creation, GET for access and DELETE for deletion (although /commit and /branch currently don't support deletion).

However they don't all refer to objects the same way. For example:

# creating a file called foo
curl -XPUT pfs/file/foo -d @foo
# creating a branch called foo
curl -XPOST pfs/branch?commit=<commit>&branch=foo

Our scheme right now does have one pretty nice feature which is that objects of the same type always occur in the same place. Filenames are always part of the URL. Branches are always the branch parameter etc. However this seems like it might be a local maxima design wise.

Hadoop is the only publicly available option for Big Data

Readme incorrectly instructs users to set S3 bucket in registry

The line

"etcdctl set /pfs/registry/IMAGE_BUCKET <IMAGE_BUCKET>"

should be

"etcdctl set /pfs/creds/IMAGE_BUCKET <IMAGE_BUCKET>"

Too new btrfs

Is it really necessary to have btrfs of version 3.14? In the docker era this version checking is sooo ugly.

are PORT and IMAGE really necessary for all the scripts?

I think IMAGE could go away, there was actually an issue with it being set in terms of expected result of running dev-launch. PORT might be better as an environment variable. Main reason I ask is a lot more of these scripts could be make commands if there was less flag parsing.

Add a /log route.

Currently the only way to get logs out of a server is to go to the local filesystem and find them. It'd be nice if we could just get them via http. We also might want to expose them in other ways but this is a good start.

Support CoreOS Rocket and Docker

replace all 'sudo docker' calls with 'docker'?

Wanted to get the motivation here, maybe there's more to it - I know that only using sudo within a script on a command that requires it makes for better permission granularity, but...eh :) having sudo preface all docker commands is a pain for development for the large of those who are either have a docker group, or do something else with mac (my docker command is aliased to hell, for example).

Some pointers that could become the noob guide

Alright... I've tried every which way but can't get this to work :-(

Deployed a 3-node cluster to GCE. Everything appears to correct thus far!

Some curiosities:

fleetctl list-machines
MACHINE     IP              METADATA
8ce43ef5... 10.240.63.167   -
c1ecdd2f... 10.240.66.254   -
e0874908... 10.240.235.196  -

and

fleetctl list-units
UNIT                MACHINE                     ACTIVE  SUB
router.service      8ce43ef5.../10.240.63.167   active  running
router.service      c1ecdd2f.../10.240.66.254   active  running
router.service      e0874908.../10.240.235.196  active  running
shard-0-3:0.service e0874908.../10.240.235.196  active  running
shard-0-3:1.service 8ce43ef5.../10.240.63.167   active  running
shard-0-3:2.service c1ecdd2f.../10.240.66.254   active  running
shard-1-3:0.service c1ecdd2f.../10.240.66.254   active  running
shard-1-3:1.service 8ce43ef5.../10.240.63.167   active  running
shard-1-3:2.service e0874908.../10.240.235.196  active  running
shard-2-3:0.service c1ecdd2f.../10.240.66.254   active  running
shard-2-3:1.service 8ce43ef5.../10.240.63.167   active  running
shard-2-3:2.service e0874908.../10.240.235.196  active  running
storage.service     8ce43ef5.../10.240.63.167   active  exited
storage.service     c1ecdd2f.../10.240.66.254   active  exited
storage.service     e0874908.../10.240.235.196  active  exited

storage.service exited and doesn't look like this:
https://github.com/pachyderm/pfs#checking-the-status-of-your-deploy

and

docker ps
CONTAINER ID        IMAGE                  COMMAND                CREATED             STATUS              PORTS                   NAMES
91e40e9a8564        pachyderm/pfs:latest   "/go/bin/shard 1-3 p   14 minutes ago      Up 14 minutes       0.0.0.0:49157->80/tcp   shard-1-3           
7d3e6b483454        pachyderm/pfs:latest   "/go/bin/shard 0-3 p   14 minutes ago      Up 14 minutes       0.0.0.0:49154->80/tcp   shard-0-3           
5564729cafe8        pachyderm/pfs:latest   "/go/bin/router 3"     14 minutes ago      Up 14 minutes       0.0.0.0:80->80/tcp      router              
090276145d96        pachyderm/pfs:latest   "/go/bin/shard 2-3 p   14 minutes ago      Up 14 minutes       0.0.0.0:49160->80/tcp   shard-2-3

Trying to use pfs commands.... It's unclear which host and port I'm supposed to use so I'm going for whichever CoreOS instance I choose and then port 80.

So I downloaded shakespeare and am trying to create/read it and then will try the Wordcount example
https://github.com/pachyderm/pfs/blob/master/examples/WordCount.md

curl -XPOST localhost/pfs/file/shakespeare -T shakespeare.txt

and then

curl -O localhost/pfs/file/shakespeare

indeed appears to do something but not what I expect. moreing shakespeare yields:

Welcome to pfs!

This works:

curl -L http://127.0.0.1:4001/v2/keys/pfs/master

{
    "action": "get",
    "node": {
        "key": "/pfs/master",
        "dir": true,
        "nodes": [
            {
                "key": "/pfs/master/0-3",
                "value": "http://pachyderm-0.c.my-proj.internal:49154",
                "expiration": "2015-06-18T16:23:12.932416509Z",
                "ttl": 31,
                "modifiedIndex": 4932,
                "createdIndex": 406
            },
            {
                "key": "/pfs/master/2-3",
                "value": "http://pachyderm-2.c.my-proj.internal:49161",
                "expiration": "2015-06-18T16:23:04.208915382Z",
                "ttl": 22,
                "modifiedIndex": 4908,
                "createdIndex": 4014
            },
            {
                "key": "/pfs/master/1-3",
                "value": "http://pachyderm-1.c.my-proj.internal:49156",
                "expiration": "2015-06-18T16:23:14.508542222Z",
                "ttl": 32,
                "modifiedIndex": 4967,
                "createdIndex": 2588
            }
        ],
        "modifiedIndex": 351,
        "createdIndex": 351
    }
}

Architecture

I am trying to understand / evaluate Pachyderm, hope some one can take few minutes to clarify few questions:

In the HDFS variant, the whole key for performance, if any, was: "move computation to the da
ta", which was achieved by maintaining multiple parts of a big file across machines as multiple copies.

How is this addressed in pfs - does it also maintain multiple copies of same file as multiple blocks / parts (if yes, how) ?

In the MapReduce variant of hadoop, for any given job, same copy of map-reduce method gets run on all machines, which makes it not suitable for data-flow computations, and which is not the case or concept for 'micro-services'. The core concept of micro-services is to build larger complex computation out of small and different disparate components /services. Which gives more control because you can tune / control individual micro-services without affecting the whole network.

But it looks like (from the blog here http://www.pachyderm.io/blog.html) same docker image is copied across, thus effectively loosing this advantage of micro-services. Here is the quote from that blog:

You give pfs a Docker image and it will automatically distribute it throughout the cluster next to your data.

Streamline Pachyderm installation

I know that is dead simple already, 3 commands is pretty cool. I was just thinking if something like the next would be useful:

$ wget -qO- install.pachyderm.io | sh

And even having different shell scripts depending on the platform.

$ wget -qO- install.pachyderm.io/mesos | sh
$ wget -qO- install.pachyderm.io/coreos | sh

Although I understand that some people may prefer to know what exactly are they doing.

Support Consul

https://consul.io/

Consul fills a similar role to the one etcd fills now. It also has a very similar interface so supporting both should be straight forward.

Make job caching

There are a number of things we can do with timestamps and btrfs commits to help us detect when we're doing work that we've already done.

Implement Block Storage

Right now pfs' lays out data in a pretty simple way. Each path gets hashed to one of N shards and that shard stores the data. This is problematic if you want to save a file that's too large to fit on a single machine.

Fortunately I think we can implement this pretty cleanly in terms of the primitives we already have. Rather than writing the file big_file we create a directory big_file and fill it with the blocks: big_file/0, big_file/1, ... big_file/n each of these blocks will be a manageable size and will wind up spread out around the cluster since their paths will hash to different shards, this means computations over the file will be parallelized. Reading the file out is as simple as reading the chunks out in order and concatenating the result.

All of this can be implemented with some small tweaks to router; the more complex shard doesn't need to be touched.

Allow dynamic scaling of clusters

Right now clusters are statically configured and thus can't take advantage of new machines that are added to the cluster. This is a basic piece of functionality that needs to be added to pfs.

This issue will serve as the design discussion for this feature. (Details to come later).

We should plan on having this implemented for 0.4.

Job images can only be pulled from the Docker Registry

Fail `fleetctl start 1Node/*`

CoreOS Beta running on qemu.

2014/11/12 08:00:38 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:38 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
2014/11/12 08:00:38 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:38 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 200ms
2014/11/12 08:00:38 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:38 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 400ms
2014/11/12 08:00:38 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:38 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 800ms
2014/11/12 08:00:39 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:39 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 1s
2014/11/12 08:00:40 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:40 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 1s
2014/11/12 08:00:41 ERROR fleetctl.go:171: error attempting to check latest fleet version in Registry: timeout reached
2014/11/12 08:00:41 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:41 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/job/announce-master-0-1.service}, retrying in 100ms
2014/11/12 08:00:41 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:41 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/job/announce-master-0-1.service}, retrying in 200ms
2014/11/12 08:00:41 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:41 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/job/announce-master-0-1.service}, retrying in 400ms
2014/11/12 08:00:41 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:41 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/job/announce-master-0-1.service}, retrying in 800ms
2014/11/12 08:00:42 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:42 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/job/announce-master-0-1.service}, retrying in 1s
2014/11/12 08:00:43 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/12 08:00:43 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/job/announce-master-0-1.service}, retrying in 1s
Error creating units: error retrieving Unit(announce-master-0-1.service) from Registry: timeout reached

--version flag, we should be able to pass that to our services to get their versions
/version route, it might also be nice to be able to get a version from running services
Versioned api. This might be a bit farther out but it would be nice to be able to post to pfs/<version>/file/foo and know that I'm using a particular version of the API.