janelia-flyem / dvid Goto Github PK

View Code? Open in Web Editor NEW

195.0 19.0 33.0 15.64 MB

Distributed, Versioned, Image-oriented Dataservice

Home Page: http://dvid.io

License: Other

Go 99.39% Python 0.37% Makefile 0.12% Shell 0.09% Dockerfile 0.02%

go dataservice image-storage http-service connectomics big-data neuroscience key-value versioning

dvid's Introduction

DVID

DVID is a Distributed, Versioned, Image-oriented Dataservice written to support neural reconstruction, analysis and visualization efforts at HHMI Janelia Research Center. It provides storage with branched versioning of a variety of data necessary for our research including teravoxel-scale image volumes, JSON descriptions of objects, sparse volumes, point annotations with relationships (like synapses), etc.

Its goal is to provide:

A framework for thinking of distribution and versioning of large-scale scientific data similar to distributed version control systems like git.
Easily extensible data types (e.g., annotation, keyvalue, and labelmap in figure below) that allow tailoring of APIs, access speeds, and storage space for different kinds of data.
The ability to use a variety of storage systems via plugin storage engines, currently limited to systems that can be viewed as (preferably ordered) key-value stores.
A stable science-driven HTTP API that can be implemented either by native DVID data types or by proxying to other services.

How it's different from other forms of versioned data systems:

DVID handles large-scale data as in billions or more discrete units of data. Once you get to this scale, storing so many files can be difficult on a local file system or impose a lot of load even on shared file systems. Cloud storage is always an option (and available in some DVID backends) but that adds latency and doesn't reduce transfer time of such large numbers of files or data chunks. Database systems (including embedded ones) handle this by consolidating many bits of data into larger files. This can also be described as a sharded data approach.
All versions are available for queries. There is no checkout to read committed data.
The high-level science API uses pluggable datatypes. This allows clients to operate on domain-specific data and operations rather than operations on generic files.
Data can be flexibly assigned to different types of storage, so tera- to peta-scale immutable imaging data can be kept in cloud storage while smaller, frequently mutated label data can be kept on fast local NVMe SSDs. This also allows data to be partitioned across databases by data instance. Our recent datasets primarily hold local data in Badger embedded databases, also written in the Go language.
(Work in progress) A newer storage backend (DAGStore) will allow "chained storage" such that data published at a particular version, say on AWS Open Data, could be reused for later versions with only new modifications stored locally. This requires extending storage flexibility to versions of data across storage locations. DAGStore will greatly simplify "pull requests" where just the changes within a set of versions are transmitted between separate DVID servers.

While much of the effort has been focused on the needs of the Janelia FlyEM Team, DVID can be used as a general-purpose branched versioning file system that handles billions of files and terabytes of data by creating instances of the keyvalue datatype. Our team uses the keyvalue datatype for branched versioning of JSON, configuration, and other files using the simple key-value HTTP API.

DVID aspires to be a "github for large-scale scientific data" because a variety of interrelated data (like image volume, labels, annotations, skeletons, meshes, and JSON data) can be versioned together. DVID currently handles branched versioning of large-scale data and does not provide domain-specific diff tools to compare data from versions, which would be a necessary step for user-friendly pull requests and truly collaborative data editing.

DVID
Known Clients with DVID Support

Installation

Users should install DVID from the releases. The main branch of DVID may include breaking changes required by our research work.

Developers should consult the install README where our conda-based process is described.

DVID has been tested on MacOS X, Linux (Fedora 16, CentOS 6, Ubuntu), and Windows Subsystem for Linux (WSL2). It comes out-of-the-box with several embedded key-value databases (Badger, Basho's leveldb) for storage although you can configure other storage backends.

Before launching DVID, you'll have to create a configuration file describing ports, the types of storage engines, and where the data should be stored. Both simple and complex sample configuration files are provided in the scripts/distro-files directory.

Basic Usage

Some documentation is available on the DVID wiki for how to start the DVID server. While the wiki's User Guide provides simple console-based toy examples, please note that how our team uses the DVID services is much more complex due to our variety of clients and script-based usage. Please see the neuclease python library for more realistic ways to use DVID at scale and, in particular, for larger image volumes.

More Information

Both high-level and detailed descriptions of DVID and its ecosystem can found here:

A high-level description of Data Management in Connectomics that includes DVID's use in the Janelia FlyEM Team.
Paper on DVID describing its motivation and architecture, including how versioning works at the key-value level.
The main place for DVID documentation and information is dvid.io. The DVID Wiki in this github repository will be updated and moved to the website.

DVID is easily extensible by adding custom data types, each of which fulfill a minimal interface (e.g., HTTP request handling), DVID's initial focus is on efficiently handling data essential for Janelia's connectomics research:

image and 64-bit label 3d volumes, including multiscale support
2d images in XY, XZ, YZ, and arbitrary orientation
multiscale 2d images in XY, XZ, and YZ, similar to quadtrees
sparse volumes, corresponding to each unique label in a volume, that can be merged or split
point annotations (e.g., synapse elements) that can be quickly accessed via subvolumes or labels
label graphs
regions of interest represented via a coarse subdivision of space using block indices
2d and 3d image and label data using Google BrainMaps API and other cloud-based services

Each of the above is handled by built-in data types via a Level 2 REST HTTP API implemented by Go language packages within the datatype directory. When dealing with novel data, we typically use the generic keyvalue datatype and store JSON-encoded or binary data until we understand the desired access patterns and API. When we outgrow the keyvalue type's GET, POST, and DELETE operations, we create a custom datatype package with a specialized HTTP API.

DVID allows you to assign different storage systems to data instances within a single repo, which allows great flexibility in optimizing storage for particular use cases. For example, easily compressed label data can be store in fast, expensive SSDs while larger, immutable grayscale image data can be stored in petabyte-scale read-optimized systems like Google Cloud Storage.

DVID is written in Go and supports pluggable storage backends, a REST HTTP API, and command-line access (likely minimized in near future). Some components written in C, e.g., storage engines like Leveldb and fast codecs like lz4, are embedded or linked as a library.

Command-line and HTTP API documentation can be found in help constants within packages or by visiting the /api/help HTTP endpoint on a running DVID server.

Monitoring

Mutations and activity logging can be sent to a Kafka server. We use kafka activity topics to feed Kibana for analyzing DVID performance.

Known Clients with DVID Support

Programmatic clients:

neuclease, python library from HHMI Janelia
intern, python library from Johns Hopkins APL
natverse, R library from Jefferis Lab
libdvid-cpp, C++ library from HHMI Janelia FlyEM

GUI clients:

Screenshot of an early web app prototype pulling neuron data and 2d slices from 3d grayscale data:

dvid's People

Contributors

Stargazers

Watchers

dvid's Issues

Pushing images to dataset overwrites image content of other datasets

If I do

dvid node c7 raw load local xy 0,0,0 raw/.png
dvid node c7 membranes load local xy 0,0,0 membranes/.png

The dataset of raw is overwritten by membranes, e.g. as seen when doing an API call to fetch the images.

Lightning MDB storage engine gets slower as more labelmap indices are added.

Unlike leveldb variants, computation of spatial index & label indices get slower over time:

2014/04/02 00:32:01 Adding spatial information from label volume superpixels for mapping sp2body...
2014/04/02 00:32:46 Processed all superpixels blocks for layer 1/205: 44.93275195s
2014/04/02 00:33:43 Processed all superpixels blocks for layer 2/205: 56.618448829s
…
2014/04/02 01:00:27 Processed all superpixels blocks for layer 21/205: 2m17.899856337s
…
2014/04/02 04:28:03 Processed all superpixels blocks for layer 56/205: 10m13.924529914s
…
2014/04/02 08:46:46 Processed all superpixels blocks for layer 76/205: 14m52.106129852s
2014/04/02 09:01:51 Processed all superpixels blocks for layer 77/205: 15m5.189581523s

Track down this issue and see if it's inefficient processing independent of storage engine or something that lmdb handles poorly compared to leveldb.

Request for API name change: "schema" should be "metadata"

In the DVID REST API, the following call returns a json file describing the axes, resolution, pixel type, etc.

/api/node/<UUID>/<data name>/schema

The word "schema" here is misleading, because that word is traditionally used to describe the structure of e.g. a json or xml tree. This data is not a schema in that sense -- I think the better term is "metadata".

Also: at some point, we will start publishing the json schemas for the messages produced by certain DVID API calls. To avoid confusion, we should not overload the term "schema" for API calls that return anything other than a true json schema.

Extend sparse volumes to be multi-scale.

Allow multi-scale sparse volumes and surfaces. This will also be useful for Ting's split tools.

sparsevol denormalization in splitting

When writing splits of a body back into DVID, the old body, which should be completely gone, left some fragments in sparsevol. The body labels are fine.

Create labelmap identity from labels64

When generating labels via segmentation, labels64 is the natural datatype to use. When revising said segmentation, it needs to be in labelmap form. Please provide a mechanism to create a label map instance from a labels64 instance.

Don't amplify bad labelmap

If Raveler has a bad label mapping, e.g., superpixel X is present in raster but is not present in labelmap, don't abort processing as soon as its hit. Instead, give the bad superpixel a body 0 label and continue processing other voxels in the block.

default multi-rez tile behavior on request outside of main extents

Please ensure returned tiles are always TileSize x TileSize where tiles outside of the main extents are just all 0.

labelmap find closest representative point for a label

Please provide an API that allows me to retrieve a representative point for a given label. The client will specify a label and his/her current coordinates, DVID should return a point where that label can be found. The returned point should ideally be close to the provided point.

labelmap raw does not equal labels64 raw

The data returned calling raw from the labelmap, sp2body, is different from the data returned by the labels64, bodies, in the FIB25 stack. I seem to only get a single label id back when calling the labelmap.

Low-res 3D body viewer

There should be a low-resolution version of the sparse volume viewer that will render typical cell shapes in a fraction of a second. The denormalizations supporting this viewer should be updated efficiently when label merges are performed.

Two sub requirements:

a) The client should be able to specify a bounding box (often just a plane) that will be display in the viewer.

b) The user should be able to retrieve rough x,y,z coordinates by picking locations on the body.

It would be good to have this ready within the next few weeks, but the slower viewer is probably tolerable for now.

Ability to associate meta data for labels64 datatype (all datatypes)

I would like to add information to a DVID datatype on how that datatype was created. In particular, I want to associate segmentation settings used to generate a given labels64. This should probably be generalized for all DVID datatypes.

ROI not giving a regular substack size

The following should give 512x512x512 regions. The first substack returned is larger in Z

curl emdata2:8000/api/node/628/mbroi/partition?batchsize=16

Add configuration system.

Should be able to specify email to be notified, maximum log file sizes, etc.

Support schema validation for all messages.

We should start specifying the schema for the messages sent to/from dvid, and dvid should validate the schemas. Most likely, the schemas will be stored in a separate repo (for example, dvidschemas), and pulled into dvid as a submodule as part of the dvid build.

body with grayscale value

Return a sparse volume with grayscale values

MIME type of "schema" json response should just be "application/json"

When requesting a data volume "schema" (a.k.a. metadata -- see Issue #11), the response comes back with MIME type application/vnd.dvid-nd-data+json. But this call does not include any binary ND data -- it is pure json.

The MIME type of the response should therefore be application/json, and the application/vnd.dvid-nd-data+json MIME type should be reserved for actual volume data as requested in this GET request:

GET  /api/node/<UUID>/<data name>/raw/<dims>/<size>/<offset>[/<format>]

Allow background processes for batch jobs like tile generation, etc.

non-interactive requests might have to be flagged by the client because it depends how the clients use DVID. For example, Steve has a cluster job status system that polls the keyvalue type.

To accomodate these cases, I've added a query string "interactive=0" or "interactive=false" that allows client to mark a call as non-interactive. Fixed in ef5af7d

Delete doesn't work

I cannot delete datatype instances anymore. I could before.

make test failed on Mac OS X 10.7.5

Scanning dependencies of target test

github.com/janelia-flyem/dvid

runtime.main: call to external function main.main
runtime.main: undefined: main.main

API enhancement: blockshape in nd-data volume metadata

In ND-data API, it would be nice of the volume metadata info also included information about the native block shape. This would allow clients to (optionally) choose efficient request block boundaries when requesting lots of ND data.

Support Scality as BigData key-value store

Because Scality sproxyd driver is not an ordered key-value store, I'll have to store keys in a separate, fast store (SmallData store) or do brute force check on every key within a range.

If we do have to store keys, it makes sense to implement content addressable hashing for versioning since we are already paying the extra round-trip to get keys.

Support google datastore as storage engine

Make API more robust to errors

When an incorrectly formatted API call is given, e.g., a 2d size is given for a 3d request, the server has an error and recovers incompletely, not fulfilling later requests even though the system mostly stays up.

Example:

GET "/api/node/bf1/bodies/raw/0_1_2/749_617/2714_3292_2440"

The size is 2d and causes panic on conversion to 3d point.

Error when using labelmap GET on 3d volume.

Support isotropic tile generation from anisotropic data.

Request from Stephan Gerhard. If voxel data is anisotropic, XZ, YZ tile generation can optionally produce isotropic tiles, which will require interpolation from original voxel data. There is a bandwidth and CPU utilization arguments for not doing interpolation on server side, and just transmitting anisotropic tiles that get scaled appropriately on client-side.

After more though, I think default behavior should be non-isotropic tile generation but if a "isotropic=true" parameter is supplied to tile generation command, DVID will produce isotropic data.

Problems with Atlantic time?

build dvid
gives following error on Fedora 20
--- FAIL: TestParseInSydney (0.00 seconds)
format_test.go:201: ParseInLocation(Feb 01 2013 EST, Sydney) = 2013-02-01 00:00:00 +0000 EST, want 2013-02-01 00:00:00 +1100 AEDT
FAIL
FAIL time 2.508s
ok unicode 0.013s
ok unicode/utf16 0.002s
ok unicode/utf8 0.003s
? unsafe [no test files]
make[3]: *** [/home/jah/BUILDEM/src/golang-1.3.1-stamp/golang-1.3.1-stupid_step] Error 1
make[2]: *** [CMakeFiles/golang-1.3.1.dir/all] Error 2
make[1]: *** [CMakeFiles/dvid.dir/rule] Error 2
make: *** [dvid] Error 2

Automatic tiling computation

DVID should have the ability to automatically create tiles based on labels pushed to the server.

Add head command to check if a key exists

Add API to allow listing of all keys within a keyvalue type

Perhaps allow paging via a query string if the number of keys is very large.

Refactor build process

Currently, there is a mix of "go get" in the CMakeLists.txt and go package dependencies that are locked via a git repo "github.com/janelia-flyem/go". The latter is preferable because multiple version control systems under the "go get" umbrella, e.g., hg and bazaar, do not have to be installed in target computers. Also, we can lock down particular versions of the go packages.

Issues with current system:

Mix of go package inclusions across CMake and via the janelia-flyem/go repo. Should only be one, preferably a "dvid-deps" repo with all versions of all dependencies.
Dependencies of included go packages will reference packages outside janelia-flyem/go repo. Better to reference the standard import path and use GOPATH=myrepos:$GOPATH to prioritize our locked versions of packages. This is how go-deps works. Currently, we must modify import paths in source code.

References to various Go dependency and build approaches:

Expand API grayscale / label64 ND-volume GET/POST to indicate whether DVID is busy

It should be possible for a GET/POST request of an ND volume to result in a 'busy' status if DVID is in fact busy with another GET/POST request. Perhaps, adding a new URI for such a call would be sufficient (otherwise you might need to check the size of a request to see if it is a small ND volume or not). While a sophisticated, global log system could better indicate "load" on a DVID server, a datatype-specific queue is probably more than sufficient.

Discussion: constraints on 'voxels' datatype

I'm wondering if perhaps the 'voxels' datatype specification is a little more flexible than we need. I think clients would benefit from a modest simplification. For purposes of discussion, here's an example metadata request and the corresponding json response:

GET  /api/node/abc123/my_rgb_volume/metadata

...

{
    "Axes": [
        {
            "Label": "X",
            "Resolution": 3.1,
            "Units": "nanometers",
            "Size": 100
        },{
            "Label": "Y",
            "Resolution": 3.1,
            "Units": "nanometers",
            "Size": 200
        },{
            "Label": "Z",
            "Resolution": 40,
            "Units": "nanometers",
            "Size": 400
        }
    ],
    "Values": [
        {
            "DataType": "uint8",
            "Label": "intensity-R"
        },
        {
            "DataType": "uint8",
            "Label": "intensity-G"
        },
        {
            "DataType": "uint8",
            "Label": "intensity-B"
        }
    ]
}

In the example above, all three channels ("Values") is a uint8. However, the current API seems to allow each pixel to be composed of channels with multiple datatypes. That is, "R" could be uint8 while "G" could be float32. This means that clients can't treat the resulting data as a simple ND array. While that isn't impossible to deal with, it complicates the clients' job.

In numpy, for example, one could use structured arrays, but I'm not quite sure if the data can be copied directly to/from the raw buffer returned by DVID (I'd have to do some experiments). In C++, even more manual work has to be done. I don't think VIGRA (for example) has a way of dealing with such data directly. It would likely need to be copied into separate arrays anyway.

Are there any known use cases for the ability to mix pixel types within a single image? If not, I propose disallowing it. If we do, the DVID metadata response will look something like the following. (As a side note, I think the term "channels" is more descriptive than "values" in this context -- but that's a minor detail.)

{
    "Axes": [
        {
            "Label": "X",
            "Resolution": 3.1,
            "Units": "nanometers",
            "Size": 100
        },{
            "Label": "Y",
            "Resolution": 3.1,
            "Units": "nanometers",
            "Size": 200
        },{
            "Label": "Z",
            "Resolution": 40,
            "Units": "nanometers",
            "Size": 400
        }
    ],
    "DataType": "uint8",
    "Channels": ["intensity-R", "intensity-G", "intensity-B"]
    }
}

Or if you want to get a little more fancy, we can still leave room for additional per-channel metadata, such as the range of possible values in each channel (if it happens to be known):

{
    "Axes": [
        {
            "Label": "X",
            "Resolution": 3.1,
            "Units": "nanometers",
            "Size": 100
        },{
            "Label": "Y",
            "Resolution": 3.1,
            "Units": "nanometers",
            "Size": 200
        },{
            "Label": "Z",
            "Resolution": 40,
            "Units": "nanometers",
            "Size": 400
        }
    ],
    "DataType": "float32",
    "Channels": [
        {
            "Label": "indicator-red",
            "Range": [0.0, 100.0]
        },
        {
            "Label": "indicator-green",
            "Range": [0.0, 750.0]
        }
    ]
}

Permit fast voxel block retrieval API

Rather than have DVID process internal voxel blocks into requested subvolumes and planes, allow a lower latency API call: given a block index + # of blocks along x, returns optionally default compressed block data. This minimizes processing on DVID side. This also fits into how Ting requests grayscale data from arbitrarily shaped bodies.

Add ability to email contacts in case of DVID panic

Addressing issue #47 DVID should optionally email admins if any kind of panic (even those from which it recovered) occurs.

Don't save purely black (0 intensity) blocks for voxels data types.

Add ability to choose compression per data instance.

DVID currently allows choice of Snappy or LZ4 for compressing data into key/value store. Allow selection of compression per data instance, storing the selected compression into that instance record. This is also first step to returning compressed data w/o processing on DVID-side for a request, e.g., storing gzipped tiles that are simply returned on request.

Also add gzip and possibly bzip2 compression as options, including ability to select level of compression from 1 (fastest) to 9 (most compression).

New endpoint for ROI as voxels nd-data

For simplicity and rapid prototyping, there should be an endpoint for accessing ROI datasets via the usual voxels nd-data API. Specifically:

Available as plain ND-data, not RLE.
For simplicity, the mask data should be provided at full resolution, just like the grayscale data. Yes, this wastes space on the wire, because the ROI is defined block-wise. But it is dirt simple and will let us move quickly to start using ROI masks right away.
Data should be of type uint8, where 1 means "inside the roi" and 0 means "outside the roi"

http status code refinements

Right now DVID returns 400 (Bad Request) when the client requests an item that does not exist in the server. In such cases, a 404 error (Not Found) might make more sense. For example, what status code should DVID return in response to the following query?

GET /api/node/my_dset/doesnt_exist/metadata

On a related note, there's also a question regarding what DVID should return when the user has posted data, and DVID sends back an empty response. For example:

POST /api/node/abc123/mydata/keyA

If successful, DVID will return an empty response body. Should the status code be 200 (OK) or 204 (No Content)? I can see arguments for both cases.

nd-data API returns all zeros

I have a 1020x1020x1020 dataset named "gigacube" which I've initialized as follows:

dvid node a7e6 gigacube load 0,0,0 "/magnetic/gigacube_pngs/*.png"

Requesting a .png works just fine:

http://localhost:8000/api/node/a7/gigacube/raw/0_1/512_256/0_0_100

But when I attempt to request raw nd-data, I get back all zeros. For example, using dvidclient, I can check the returned data:

In [9]: from dvidclient.volume_client import VolumeClient
In [10]: vol_client = VolumeClient( "localhost:8000", "a7", "gigacube" )
In [11]: cutout_array = vol_client.retrieve_subvolume( (0,0,0,0), (1,100,100,100) )
In [12]: print cutout_array.sum()
0

The REST API call used in the above example is something like this:

http://localhost:8000/api/node/a7/gigacube/raw/0_1_2/100_100_100/0_0_0

DVID is sending back a properly formatted message, with the correct buffer size. In the dvid server log, I see nothing unusual:

2014/03/18 17:40:21 HTTP GET: 3d volume (100,100,100) at offset (0,0,0) (/api/node/a7/gigacube/raw/0_1_2/100_100_100/0_0_0): 10.88681ms

Bug in "schema" (a.k.a. metadata) json

When requesting the metadata for a grayscale uint8 volume, I received the following json. Note that the "Values" section lists the pixel type as "T" instead of "uint8". That's a bug, right?

{
  "Axes": [
    {
      "Label": "X",
      "Resolution": 10,
      "Units": "nanometers",
      "Size": 900,
      "Offset": 0
    },
    {
      "Label": "Y",
      "Resolution": 10,
      "Units": "nanometers",
      "Size": 1000,
      "Offset": 0
    },
    {
      "Label": "Z",
      "Resolution": 10,
      "Units": "nanometers",
      "Size": 800,
      "Offset": 0
    }
  ],
  "Values": [
    {
      "T": 0,
      "Label": "grayscale"
    }
  ]
}

Note that in the "Values"

Revise volume creation parameters for REST API

Right now the parameters used to create a new volume in dvid are not well documented. But beyond that, it would be nice if the client could specify exactly what the datatype of the pixels are. For example, all of the information in the metadata json should be provided when creating a new volume.

This would require at least the following enhancements:

DVID needs to support voxels data with an arbitrary number of channels (currently the client is limited to the predefined datatypes, e.g. grayscale8, rgba8).
DVID needs to support float32 as a pixel type

CORS is not enabled for non-datatype API calls

foreground roi test fails occasionally

CI test fails on foreground roi test occasionally. See https://drone.io/github.com/janelia-flyem/dvid/360 and following run succeeds. Likely some kind of race condition involving the status of the foreground ROI.

Add rotating logs.

Allow standard tiff image import

Requested by Stephan Gerhard for CATMAID

Error creating a new keyvalue datatype

I try creating a new keyvalue datatype with:

curl -X POST http://emdata1/api/dataset/339/new/keyvalue/classifiers -d '{}'

and get:

ERROR using REST API: Config data structure has not been initialized (/api/dataset/339/new/keyvalue/classifiers). Use 'dvid help' to get proper API request format.