juliacomputing / datasets.jl Goto Github PK

View Code? Open in Web Editor NEW

107.0 20.0 3.0 794 KB

License: MIT License

Julia 100.00%

datasets.jl's Introduction

DataSets

DataSets helps make data wrangling code more reusable.

We want to make it easy to relocate an algorithm between different data storage environments without code changes. For example from your laptop to the cloud, to another user's machine, or to an HPC system.
We want to reduce coupling between data and code, by storing rich type information in metadata. Metadata bridges the gap between the ad hoc implicit type system of data outside your program and the Julia data structures within your program.

Watch DataSets.jl talk from JuliaCon 2021, or read the latest documentation more information.

Development

datasets.jl's People

Contributors

Stargazers

Watchers

Forkers

jing-xinxing aerdely sidpku

datasets.jl's Issues

Why do I need to `open` a dataset twice?

This workflow feels funny. Am I doing this wrong?

julia> blob = open(Blob, dataset("us_counties"))
📄 data @ JuliaHub/bcf2ed95-b0a2-40bf-8d62-12a0de4e2a44/v1

julia> df = open(io->CSV.read(io, DataFrame), IO, blob)
1105438×6 DataFrame
...

Is there an easier way to CSV.read a blob?

File materialization functions

Conceptually, our datasets don't necessarily have an underlying file object on disk (e.g. the datasets fully stored in TOML files). As such, we only offer access to their contents via an ::IO object. However, sometimes you need to access it a File as a file system file (e.g. #60). We could export two file "materialization" functions:

save(::File/FileTree, path::String): takes the File or FileTree object and writes it to path. The purpose of these functions is to allow the extraction of datasets from data repositories.
materialize(::File): returns a file path; the purpose is to give (read) access to the dataset as a file system file. For datasets that already have a file somewhere on disk anyway, return that. For datasets that don't (e.g. TOML file, remote datasets), it would automatically create a temporary file (lifetime is the Julia session; would fix #60).

Addressing versioned data

Suppose I want to use git to store and version a dataset. How would I open a particular version of that dataset?

One option mentioned elsewhere by @pfitzseb would be to have dataset("name") load the latest version and have some syntax like dataset("name@v2") / dataset("name#hk98s2") load a specific version/hash, much like Pkg.

Another similar idea would be to add keyword arguments to dataset(). But I do think there's some benefits to using syntax within the string rather than keywords. URLs show how useful a standard string representation of resources-with-parameters can be.

The URN RFC is a good source of inspiration here. In particular they specify three sets of parameters, the r-component, q-component and f-component - see https://datatracker.ietf.org/doc/html/rfc8141#page-12 :

r-component - parameters passed to the name resolver. (For us, this corresponds to passing parameters to the AbstractDataProject.)
q-component - parameters passed to either the named resource or a system that can supply the requested service. The q-component is specified to have the same syntax as the query part of a URL. (For us, I guess this corresponds to passing parameters to the storage backend when it open()s the dataset.)
f-component - interpreted by the client as a specification for a location within, or region of, the named resource; similar to the fragment of a URL. (For us, this would be parameters applied to the object which comes from open()ing a dataset. For example, to supply a relative path within a BlobTree.)

While I think the URN RFC has some useful concepts I'm not super keen on their syntax which is like URI syntax but confusingly subtly different, with the normal query part prefixed with ?= as name?+rcomponent?=qcomponent#fcomponent.

But I'm also not sure the Pkg syntax is quite what we want. For packages it's useful to make versioning very central in the syntax to the extent of taking up two different types of syntax just to specify versions. Unlike Pkg I think there could be other parameters we might want to pass when addressing data storage, not just a version.

Better interoperation with Parquet files

This is a joint DataSets and Parquet issue — at the root of issue #59 is really that Parquet.jl is currently entirely file-based (cf JuliaIO/Parquet.jl#145). I would love it to have a more seamless (and efficient!) way to work with Parquet files directly by streaming them and/or only grabbing the parts I need.

Ambiguity with open(::Function, ::DataSet)

julia> open(io->CSV.read(io, DataFrame), dataset("us_counties"))
ERROR: MethodError: open(::var"#13#14", ::DataSet) is ambiguous. Candidates:
  open(f::Function, args...; kwargs...) in Base at io.jl:322
  open(as_type, conf::DataSet) in DataSets at /home/jrun/data/.julia/packages/DataSets/ssxgC/src/DataSets.jl:278
Possible fix, define
  open(::Function, ::DataSet)
Stacktrace:
 [1] top-level scope at REPL[48]:1

Creating, deleting and updating datasets

We need some programmatic way to create datasets, to update their metadata and to delete them. Currently people need to manage this manually by writing TOML but clearly this isn't great.

API musings

One possibility is to overload the dataset() function itself with the ability to create a dataset. For example adding a create=true flag:

dataset("SomeData", create=true, tags=[...], description="some desc", other_args...)
dataset(project, "SomeData", create=true, tags=[...], description="some desc", other_args...)

Another idea would be to pass a verb along as a positional argument, such as

dataset("SomeData", :create; description="some desc", other_args...)
dataset("SomeData", :delete)
dataset("SomeData", :update, description="new desc")

With :read being the default verb. This allows us to reuse the exported dataset() function for all dataset-related CRUD operations.

But let's be honest this is little weird other than being economical with exported names. Perhaps I've been doing too much REST recently :-) Probably a better alternative would be to just have a function per operation:

DataSets.create("SomeData"; description="some desc", other_args...)
DataSets.delete("SomeData")
DataSets.update("SomeData", description="new desc")

update() is a bit of an odd one out of these operations — what if you wanted to delete some metadata? I guess we could pass something like description=nothing for deleting metadata items.

Which data project?

When creating a dataset it needs to be created within "some" data project. Presumably this would be the topmost project in the data project stack, or within a provided project if the project is supplied as the first argument.

Data ownership

Creation — and especially deletion — brings up an additional problem: How do we distinguish between data which is "owned" by a data project (so that the data itself should be deleted when the dataset is removed from the project), vs data which is merely linked to?

For existing data referenced on the filesystem this is particularly relevant. We don't want datasets() to delete somebody's existing data which they're referring to. But neither do we want DataSets.delete() to leave unwanted data lying around.

I think we should have an extra metadata key to distinguish between data which is managed-vs-linked-to by DataSets. Perhaps under the keys linked, or managed or some such. (Should this go within the storage section or not?)

Processing pipeline on dtype level

We want to be able to declare some amount of data processing on the level of DataSets. This is very much related to the users being able to declare layers of processing to help interpret a dataset (e.g. that a random set of bytes is actually a table in a CSV format; c.f. #17).

I would suggest fundamentally thinking about this on the level of dtypes (i.e. File, FileTree right now, but we want others like Table and Image too). That is, it conceptually a function taking one dtype in, and producing another one (possibly the same one). The implementation for each of these processors simply relies on the abstract interface of the dtype (e.g. an IO stream for File, or a Tables.jl-interface for a Table input).

A few examples:

Decompressing a file is a decompress(::File) -> File operations.
Same for decrypting: decrypt(::File; key=...) -> File. But sometimes you need to pass options as well, as we can't always automagically infer everything.
Unpacking a tarball: unpack(::File) -> FileTree.
Parsing a table or an image: table(::File) -> Table or image(::File) -> Image.
Even accessing files from a FileTree is really just a FileTree -> FileTree / File operation.

This quite naturally lends itself to forming a pipeline

open(File, dataset("an_encrypted_tar_gz")) |> decrypt(key = ...) |> decompress |> unpack

I imagine such operations be implemented in a separate packages. They depend on DataSets, and any other packages providing dtypes. On the other hand, they would mostly be interface packages for other package (e.g. DataSetTables.jl would probably depend on multiple tabular data formats, like CSV.jl and Arrow.jl).

I also imagine that this pipeline will largely be implemented lazily, although that would be an implementation detail.

Declared layers

Once you have that general logic of transforming between dtypes, you can take advantage of this to implement the layers of #17. Each layer is just a call to one of these processors.

A question is how to declare this in the metadata (e.g. TOML file). One possibility is to declare the Julia function in processor = "DataSetTables.table". It should then have a method (::File, config::Dict) -> Table method. config would correspond to (optional) configuration parameters defined in the metadata.

For first iteration, I wouldn't worry too much about code loading. It's up to the user to make sure they have the correct package / module in the Project.toml and loaded. At some point we could add package UUIDs and compat checks to the Data.toml.

Allow indexing FileTree with string keys with path components in them

Currently:

julia> filetree["path/to/file.ext"]
ERROR: Path components cannot contain '/' or '\' (got "path/to/file.ext")
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] joinpath(path::DataSets.RelPath, xs::String)
   @ DataSets ~/data/.julia/packages/DataSets/DWg6S/src/paths.jl:28
 [3] getindex(tree::BlobTree{JuliaHubData.DataRepoCache}, name::String)
   @ DataSets ~/data/.julia/packages/DataSets/DWg6S/src/BlobTree.jl:326
 [4] top-level scope
   @ REPL[96]:1

You need to use the special @path_str macro or explicitly construct a RelPath for this to work. It'd be nice to just allow using the complete key as a string.

Fast and flexible iteration and filtering of files in `BlobTree`s

I am working on a dataset with many files (over half a million) and with an intricate file structure. I don't need to use all the files at once. In fact, in different phases of the project, I will need to select different subsets of files. For now, I am using information encoded in the file names or in the file path (like the subdirectory under which a file lives) to decide which files to use, but, ideally, I am looking forward to fast and flexible ways to interact with DataSets.jl datasets.

I have created the following toy TOML-embedded dataset to illustrate the case:

data_config_version = 1

[[datasets]]
description = "Letters data tree embedded in a TOML file"
name = "embedded_letters"
uuid = "b941d775-105a-4606-868b-f81bd02adbe0"
# TOML.print(Dict("datasets" => [Dict("storage" => Dict("data" => Dict("letters" => Dict("a" => Dict("aa" => Dict("aa.file" => "aa file")),"b" => Dict("b.file" => "b file","bb" => Dict("bbb" => Dict("bbb.file" => "bbb file","bbb.filo" => "bbb filo",),),),),),),),],),)

    [datasets.storage]
    driver = "TomlDataStorage"
    type = "BlobTree"

        [datasets.storage.data.letters.b]
        "b.file" = "b file"

            [datasets.storage.data.letters.b.bb.bbb]
            "bbb.file" = "bbb file"
            "bbb.filo" = "bbb filo"

        [datasets.storage.data.letters.a.aa]
        "aa.file" = "aa file"

Imagine I need the files such that the string "b.file" is in the file path.

I can take advantage of the abstract tree interface and iterate over Leaves, but this turns out to be very slow.

using DataSets
using AbstractTrees

ds = open(dataset("embedded_letters"))
for file in Leaves(ds)
    if occursin("b.fil", string(file.path))
        println(file.path)
    end
end
> letters/b/b.file
> letters/b/bb/bbb/bbb.file
> letters/b/bb/bbb/bbb.filo

I thought of an alternative approach using the FileTrees.jl package. Assuming that I can reproduce the file structure of embedded_letters as a FileTrees.jl tree, then I find that filtering files is much faster. (Here, I create the tree manually.)

using FileTrees
using FilePathsBase

ft = maketree(
    "letters" => [
        "a" => ["aa" => ["aa.file"]],
        "b" => ["b.file", "bb" => ["bbb" => ["bbb.file", "bbb.filo"]]],
    ],
)
ft_filtered = filter(x -> occursin("b.fil", x.name), ft, dirs = false);
for file in files(ft_filtered)
    println(Path(file))
end
> letters/b/b.file
> letters/b/bb/bbb/bbb.file
> letters/b/bb/bbb/bbb.filo

I am considering to do the filtering of files using FileTrees.jl, prepare a list of paths as in the example, and finally visit the selected paths in the DataSets.jl dataset. Of course, it would be much better to improve on the DataSets.jl side so that I can avoid the detour.

Circuitscape use case

Background

I've been looking at Circutscape.jl as an interesting use case for DataSets.jl. Here's a design for how DataSets could support circuitscape user workflows.

Circuitscape is an interesting case because it's a complete application with existing data management code etc — there's the Circuitscape.compute() function which takes a config file and uses that to discover the input data and output location, and the Circuitscae.start() function which is a wizard which helps users create such a config file0. Because DataSets tries to do IO management and data discovery, some of the data discovery parts of Circuitscape should be replaced with a DataSets-based interface.

I think users should be able to interactively

Manage their project datasets — provided by the data REPL (in future, perhaps some GUI data browser)
Launch circuitscape jobs — provided by a data REPL run command.

Workflow example

Here's a quick sketch of the workflow:

The wizard Circuitscape.start() acts as it does currently, but instead of linking to existing data in some arbitrary location in the filesystem, it copies the data into a new DataSet. The type of that dataset can be CircuitScapeInput or some such — internally it's just backed by the exact same directory structure as Circutscape currently has.

data> run circuitscape   # If run with no data, calls start (?)

# wizard steps ...

[ Info: Created new input dataset `raster_pairwise_1`

data>

I'm imagining that the Circuitscape.compute() would be replaced by the data REPL run command, and add functionality for listing which data is available for running with. Something like:

Available circuitscape input data:
  📂 raster_pairwise_1      type=CircuitScapeInput
  📂 raster_one_to_all_1    type=CircuitScapeInput

data> run circuitscape raster_pairwise_1 output1!
[ Info: ...

data> ls
  📂 output_1               type=CircuitScapeOutput
  📂 raster_pairwise_1      type=CircuitScapeInput
  📂 raster_one_to_all_1    type=CircuitScapeInput

For run to work, the data REPL needs to be resurrected and taught look at the database of entry points which is currently set up by @datafunc. Then circuitscape would declare several data entry points @datafunc circuitscape to hook into data> run.

Use Extensions to support direct file reading support like `CSV.read(dataset("..."))`

It'd be great to overload some common file reader methods like CSV.read, JSON.read, Parquet.read, etc., to support reading directly from a ::DataSet reference. Having just a handful of these would drastically simplify the use of DataSets

mktemp is using a directory that doesn't yet exist?

I have a blob. I try to open it. I get:

julia> open(io->CSV.read(io, DataFrame), IO, blob)
ERROR: SystemError: mktemp: No such file or directory

I redefined mktemp to show me what it's trying to do:

julia> @eval Base.Filesystem function mktemp(parent::AbstractString=tempdir(); cleanup::Bool=true)
           @show parent
           b = joinpath(parent, temp_prefix * "XXXXXX")
           @show b
           p = ccall(:mkstemp, Int32, (Cstring,), b) # modifies b
           systemerror(:mktemp, p == -1)
           cleanup && temp_cleanup_later(b)
           return (b, fdio(p, true))
       end
mktemp (generic function with 4 methods)

julia> open(io->CSV.read(io, DataFrame), IO, blob)
parent = "/tmp/jl_Oh98Jz"
b = "/tmp/jl_Oh98Jz/jl_XXXXXX"
ERROR: SystemError: mktemp: No such file or directory

I make that directory:

shell> mkdir /tmp/jl_Oh98Jz

And now it works:

julia> open(io->CSV.read(io, DataFrame), IO, blob)
parent = "/tmp/jl_Oh98Jz"
b = "/tmp/jl_Oh98Jz/jl_XXXXXX"
1105438×6 DataFrame

(This is in VS Code on JuliaHub, using the workaround to get a Data.toml from JuliaHub)

API to get the cached path for a blob

All the ways to work with a Blob (er, I suppose after 0.2.7 it's just File) are IO-based. But it's backed by a file that's cached in tmp. Not all packages work with IO-based methods and instead want a direct path to a file. How can I get that file? It reports to Julia that it isfile and will happily return its abspath... but it's not really either!

The role of the dataset UUID

Right now, a dataset has two unique identifiers (unique within a data repository anyhow): name :: String and uuid :: UUID. The UUID is mandatory, but not really used as far as I can tell. I think we should explicitly document its role. A few thoughts:

UUID could be a more permanent way of referencing a dataset. It would also allow you to disambiguate if there are multiple data repositories with different datasets that have the same name.
A data repo should be thought of as UUID => DataSet dictionary. The name is there as additional metadata, for user-convenience.
It should be noted that there is nothing stopping you from having duplicate UUIDs (or names) referring to potentially very different data. However, for UUIDs, we would expect this to be rare.

There are a few tangential API changes we could do:

We should also probably introduce APIs for accessing datasets via a UUID (I don't think that exists right now).
We should probably allow users to rename datasets. It's not recommended to do it often, but it can be handy. The UUID would stay constant through renames, for cases where stability is important. This is really the case where UUID becomes important.

Refresh documentation

At this point with a bit of experience with using this library, the docs should be modernized a bit, including some of the following:

Document the data REPL (and anything else from the JuliaCon talk I'd forgotten to get in the docs!)
Document the new URI-like form of arguments passed to dataset()
De-emphasize the high level @datafunc — perhaps we'll remove it in the future?

Concept of managed datasets for create/update/delete

One issue that is see with implementing create/update/delete operations (#31, #38) here in DataSets is that different data repositories may have a very different ideas of how to execute them, and may require repository-specific information.

A case in point: the TOML-based data repos generally just link to data. Should deletion delete the linked file, or just the metadata? If you create a new dataset, where will the file be? Do you need to pass some options?

One design goal of DataSets is that it provides a universal, relocatable interface. So if you create datasets in a script, that should work consistently, even if you move to a different repository. But if you have to pass repository-specific options, that breaks that principle.

To provide create/update/delete functionality in a generic way, we could have the notion of managed datasets. Basically, the data repository fully owns and controls the storage. When you create a dataset, you essentially just hand it over to the repository, and as the user you can not exercise any more control in your script.

For a remote, managed storage of datasets, this is how it must work by definition. But we should also have this for the local Data.toml-based repositories. I imagine that your repository would manage a directory somewhere where the data actually gets store, e.g.:

my-data-project/Project.toml
               /Data.toml
               /.datasets/<uuid1-for-File>
               /.datasets/<uuid2-for-FileTree>/foo.csv
               /.datasets/<uuid2-for-FileTree>/bar.csv

If now you create a dataset in a local project from a file with something like

DataSets.create("new-ds-name", "local/file.csv")

it will generate a UUID for it and just copy it to .datasets/<uuid>. This way we also do not have any problems with e.g. trying to infer destination file names and running into conflicts.

A few closing thoughts:

A data repo might not support managed datasets at all. That's fine, you just can't create/update/delete datasets then, just read existing ones. It may also have some datasets that are unmanaged, even if it otherwise does support them.
All "linked" datasets in a TOML file would be unmanaged, and hence read-only. It would even be worth implementing them via a separate storage driver, in order not to conflate it with the implementation for standard datasets. Not sure about an API for creating such a dataset -- it probably would have to be specific to a data repo, because such a dataset only make sense for some repositories.
You might be able to convert linked datasets into managed ones though, which will copy it to the repositories storage (whatever that may be).

Rename `BlobTree -> FileTree`

While BlobTree/Blob are "technically correct" names they're not very familiar or intuitive.

Some possible renaming options:

FileTree/File — Pretty good, except clashes with the somewhat unrelated package FileTrees.jl
Directory/File — Not bad, but this isn't actually a filesystem abstraction
Dir/File — Dir seems a bit short. People shouldn't need to type this often
Folder/File — technically correct, but less symmetry and doesn't suggest hierarchy

Having discussed this with a few people, we came to the conclusion that FileTree/File is the most transparent option. The naming clash is unfortunate, but I think we can put up with it for the benefits. (In any case, DataSets.File will clash with FileTrees.File regardless, unless we called it something else. But there's no real obvious alternative to File.)

DataSets for Testing Use Case

As a user, I have a package registered on JuliaHub, and I want to upload a dataset to JuliaHub which a test script (e.g., test/runtests.jl) from my registered package can access.

Here are some questions on how this package should be setup:

Should my package include a Data.toml & the project be activated from the test script using DataSets.jl's API?
Should my package rely on DataSets.jl's global configuration instead of including a Data.toml with my package?
Who should be the owner of the dataset? Me or a group?
Is this a valid use case for DataSets.jl?

The road to DataSets 1.0

Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.

Using data handles with Distributed.jl

DataSets is kind of hard to use with Distributed.jl.

The main usability problem is that it's impossible to deserialize handles like Blob or BlobTree when they rely on the availability of local resources such as disk caches which are only available on the main node. One is forced into less natural use patterns such as sending keys between the nodes.

Somehow it would be nice to make this more natural.

"Ideally" you'd like to have the dataset open on all nodes, and to transparently hook up any serialized Blob to the local data cache during deserialization. I'm not sure this is really possible, but it's something to aspire to!

Remote drivers

I love everything about this package and how it treats data. But it is really hard to incorporate for non-local data. Is there a sanctioned way to deal with data hosted on, say, AWS S3 or similar?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

JuliaAstro/JuliaSpace use case

Continuation of our discussion on Zulip: https://julialang.zulipchat.com/#narrow/stream/295423-juliaspace/topic/Lift-off!/near/248190318

CC: @ronisbr

Background

Within the JuliaAstro/JuliaSpace ecosystem there are several packages which need acces to data sets on the internet some of which get updated regularly.

This includes:

Ephemerides (static): https://naif.jpl.nasa.gov/pub/naif/generic_kernels/
Earth Orientation Parameters (weekly updates): https://www.iers.org/IERS/EN/DataProducts/EarthOrientationData/eop.html
Space weather indices (daily updates): e.g. ftp://ftp.seismo.nrcan.gc.ca/spaceweather/solar_flux/daily_flux_values/fluxtable.txt

Workflows

We foresee several different workflows depending on the environment:

The REPL workflow: A REPL (or Pluto/Jupyter) user should be able to start working without worrying about the required data. Data downloading and loading should happen automatically in the background (at package load time or better lazily upon function invocation) and be completely transparent.
"Traditional" operational systems and expert users: It should be possible to override the default mechanism and provide custom data potentially from a central data storage in a traditional file-based space operations system.
Reproducible scientifc analyses: For the sake of reproducibility, users should be able to fix dynamic data to a specific point in time, see ScienceState from Astropy.

Current Solution

We currently use a combination of OptionalData.jl and RemoteFiles.jl to handle workflows 1 & 2. As of now, we do not have a solution for workflow 3.

Here's an example from EarthOrientation.jl:

mutable struct EOParams
   # Fields omitted
end

# OptionalData.jl provides a type-safe wrapper for the data set
@OptionalData EOP_DATA EOParams "Call 'EarthOrientation.update()' to load it."

# RemoteFiles.jl is used to download and update the data
@RemoteFileSet data "IERS Data" begin
    iau1980 = @RemoteFile(
        "https://datacenter.iers.org/data/csv/finals.all.csv",
        file="finals.csv",
        updates=:thursdays,
    )
    iau2000 = @RemoteFile(
        "https://datacenter.iers.org/data/csv/finals2000A.all.csv",
        file="finals2000A.csv",
        updates=:thursdays,
    )
end

# Download data and `push!` it into the optional data set. Can be called from `__init__`
function update(; force=false)
    download(data; force=force)
    push!(EOP_DATA, paths(data, :iau1980, :iau2000)...)
    nothing
end

Issues

Recursive update: Every package in a dependency chain needs to implement an update function for manual updates. For example, AstroBase.jl depends on AstroTime.jl which depends on EarthOrientation.jl. In principle, AstroBase.jl needs an update function which call AstroTime.jl's update function which calls EarthOrientation.jl's update function and so on.
Data dependency injection: It is unclear in which package or function the data dependency should be introduced. (I am not really sure what I mean by this so I will try to give examples).
- In the example above, the dependency on the EOP data is introduced in the lowest level package.
- Another example, is Astrodynamics.jl depending on AstroBase.jl. AstroBase is supposed to keep things abstract and thus does not add a dependency on an ephemeris, the function for planetary positions looks like position(eph::AbstractEphemeris, t, ...). Astrodynamics.jl is the top-layer opinionated metapackage and defines a global default ephemeris via the approch above, e.g. position(t, ...).
- I have no idea if one actually needs both ways or if one pattern is strictly better than the other.

Large dataset examples

It'd be nice to have a practical list of examples of scientific data workflows as inspiration for DataSets.jl.

Here's some examples I came across on discourse:

Reading and chunk-based parallel processing of NetCDF files: https://discourse.julialang.org/t/help-me-beat-my-pythonist-friends-code-speeding-up-data-reading-with-simple-reduction-from-netcdf-file/76457/31
Downloading protein data and untarring it on the fly: https://discourse.julialang.org/t/how-to-plumb-together-download-uncompress-untar-without-writing-full-downloaded-file/76668/6

allow attaching a script file to the dataset

When creating a dataset, it's often convenient to be able to upload a script that was used to generate the dataset. Generating derived datasets can be a complicated process and tracking where the data came from is important in heavily audited environments. Attaching a script and perhaps optional toml files should be sufficient to track where the dataset came from.

Iteration of `BlobTree` — `pairs` or `values`? `basename`?

BlobTree currently iterates values for much the same reasoning that Dictionaries.jl does (broadcast support, etc). For cases where the name is important this might seem inconvenient. But individual Blob or BlobTree elements in the iteration also know their names via basename which makes it possible to extract names where necessary. To resolve this it's probably best to translate some examples of data processing code to the BlobTree API and see whether pairs() or values() is desired for iteration.

As a side note, having values know their own keys via basename is quite an oddity for a dictionary-like datastructure. Is this a problem in itself? Generally, Blob is only a lazy reference to data held outside Julia. An individual Blob object presumably won't be an element of two different BlobTrees — it needs to cause a copy in the background. Perhaps this isn't so bad, then.

DataSets is hard to use with Distributed.jl

A long distributed job is likely to want to open a dataset on each node and keep it open for the whole computation.

The use pattern in Distributed seems to be many RPCs driven by the main node, with the main node kicking off further computation as required, and with Distributed itself owning the event loop which drives the communication.

However, DataSets package mostly supports the scoped form

open(T, dataset("name") do data
    work(data)
end

The idea was that this would wrap around the program at a high level, but this seems largely inconsistent with the fact that one needs to exit the scope to return results to the Distributed event loop. #12 should help with this, by making unscoped open() more reliable and providing another way to manage resources.

A separate usability problem is that it's impossible to serialize handles like Blob or BlobTree when they rely on the availability of local resources such as disk caches which are only available on the main node. One is then forced into less natural use patterns such as sending paths between the nodes.

New storage backend API based on ResourceContexts

Currently the storage backend API isn't very formalized and it has some problems which would be nice to fix.

Firstly, the backend entrypoint it doesn't support ResourceContexts natively, instead it's just a single method which provides the user's callback with an open "dataset root". In #12 I needed to invent ResourceContexts.enter_do to expose the data handle from within the do block but enter_do is kind of hacky/complex/inefficient as it needs to use a separate task stack.
Secondly, the "dataset root" type which is opened by the entrypoint has an informally defined API for use with our BlobTree abstraction; it's kind of complex and poorly distinguished from the internals of BlobTree itself, which makes it hard to implement and leads to suboptimal code reuse between backends.

Likely we should have an AbstractDataStorage type and perhaps define an open_storage function to go with this to solve the first problem. The second problem needs some joint refactoring of the existing tree storage backends to extract the common code.

juliacomputing / datasets.jl Goto Github PK

datasets.jl's Introduction

DataSets

Development

datasets.jl's People

Contributors

Stargazers

Watchers

Forkers

datasets.jl's Issues

API musings

Which data project?

Data ownership

Declared layers

Background

Workflow example

Background

Workflows

Current Solution

Issues

Recommend Projects

Recommend Topics

Recommend Org