Git Product home page Git Product logo

slkspec's Introduction

CI Linter

slkspec

This is work in progress! This repository showcases how the tape archive can be integrated into the scientific workflow.

Pull requests are welcomed!

import fsspec

with fsspec.open("slk:///arch/project/user/file", "r") as f:
    print(f.read())

Loading datasets

import ffspec
import xarray as xr

url = fsspec.open("slk:////arch/project/file.nc", slk_cache="/scratch/b/b12346").open()
dset = xr.open_dataset(url)

Usage in combination with preffs

Installation of additional requirements

mamba env create
mamba activate slkspec
pip install .[preffs]

Open parquet referenced zarr-file

import xarray as xr
ds = xr.open_zarr(f"preffs::/path/to/preffs/data.preffs",
                storage_options={"preffs":{"prefix":"slk:///arch/<project>/<user>/slk/archive/prefix/"}

Now only those files are retrieved from tape which are needed for any requested dataset operation. In the beginning only the file containing the metadata (e.g. .zattrs, .zmetadata) and coordinates are requested (e.g. time). After the files have been retrieved once, they are saved at the path given in SLK_CACHE and accessed directly from there.

slkspec's People

Contributors

antarcticrainforest avatar dependabot[bot] avatar observingclouds avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

slkspec's Issues

support for `fsspec.open_local`

I'd expect that fsspec.open_local("slk://...") would work and would return the path to a cached copy of some file from tape in my filesystem. However, it raises an error:

ValueError: open_local can only be used on a filesystem which has attribute local_file=True

see also: docs for open_local

This would simplify interfacing with tools which require to access files via a local path.

Files remain in queue when opened but not read

Some of the current tests load files from the mock-up tape archive but do not read them. As a consequence self._cache_files() is never called and as such self._file_queue.get() and self._file_queue.task_done(). Those files remain in the queue and cause problems in other tests.

test_reading_dataset cannot be currently called after calling test_ro_mode

Opening files in a context-manager is probably a good idea, but we might also want to have a way to reset the queue or only process the part of the queue, that we are actually requesting.

Add xarray accessor to retrieve files from tape

In addition to xr.Dataset().load() and xr.Dataset().persist() a method that retrieves files from the tape but does not load them into memory would be a useful addition here. xr.Dataset().retrieve() or xr.Dataset().cache() might be options.

Combination of retrievals

There are a couple of examples where the combination of retrievals is not working efficiently yet or has some issues:

  • retrieval of resources opened with pure fsspec without url-chaining
  • #9
  • #10
  • #11
  • #19

Retrievals in combination with xarray depend on the used cluster

xarray seems to request different amounts of files concurrently depending on the cluster configuration:

levante interactive

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0|0.0.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.1.0|0.1.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"1.0.0|1.0.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"1.1.0|1.1.1"}}]}'

levante compute

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0|0.0.1|0.1.0|0.1.1|1.0.0|1.0.1|1.1.0|1.1.1"}}]}'

Issues with intake-esm

    On levante:
import intake
import json
import pandas as pd
with open("/pool/data/Catalogs/dkrz_cmip6_disk.json") as f:
    catconfig=json.load(f)
testcat=intake.open_esm_datastore(esmcol_obj=pd.read_csv("/home/k/k204210/intake-esm/catalogs/dkrz_cmip6_archive.csv.gz"),
                                  esmcol_data=catconfig)
subset=testcat.search(source_id="MPI-ESM1-2-LR",
               experiment_id="ssp370",
               variable_id="tas",
              table_id="Amon")
import os
if "slk" not in os.environ["PATH"]:
    os.environ["PATH"]=os.environ["PATH"]+":/sw/spack-levante/slk-3.3.67-jrygfs/bin/:/sw/spack-levante/openjdk-17.0.0_35-k5o6dr/bin"
SLK_CACHE="/scratch/k/k204210/INTAKE"
%env SLK_CACHE={SLK_CACHE}
subset.to_dataset_dict()

calls 33 slk retrieves which call 33 /sw/spack-levante/slk-3.3.67-jrygfs/lib/slk-cli-tools-3.3.67.jar retrieve (66 processes in total) for 10 unique tars from the same directory on hsm. that cant be right

Originally posted by @wachsylon in #3 (comment)

Temporary Files

Some feature requests:

  • what happens if I delete the temporary file before retrieval started? It seems to me like the process hangs...
  • Is there a way to resume an old retrieval? Could we add that?

Queue is not emptied after process fails

The retrieval queue is not emptied when a retrieval failed, leaving the finish pseudo retrieval and the failed retrieval in the queue. A new retrieval fails because finish cannot be found.

(Pdb) self._file_queue.queue
deque([('finish', 'finish')])

Make slkspec asynchronous

To allow for efficient retrievals, slkspec needs to become asynchronous and collect requests in a queue to bundle those requests efficiently to a sly retrieve command.

Limit processing of queued files to 500 per job

On the compute, shared and interactive partitions, slk retrieve is allowed to retrieve 500 files at once. Thus, if more than 500 files are requested here, it should be split into several retrievals. @antarcticrainforest Or would you split up the file list to parts <501 anyway before calling this function? I'll try to implement this feature ("group_files_by_tape") to the slk_helpers as soon as possible. But currently, I am mainly bound to slk testing and user support. So, let's see ;-) .

Originally posted by @neumannd in #3 (comment)

Combining retrievals from different subdirectories

Currently, retrievals from different subdirectories are not combined. Initially this was done because an older version of slk did not create subdirectories locally. This has been changed in slk to allow more efficient retrievals. slkspec should reflect these changes, so that independent of the directory, retrievals can be merged.

The following code can be used as a test:

import xarray as xr
xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")

Currently the output looks like the following:

# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/.zmetadata
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/time/0
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/time","$max_depth":1}},{"resources.name":{"$regex":"0"}}]}'
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/time/0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/lat/0
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/lat","$max_depth":1}},{"resources.name":{"$regex":"0"}}]}'
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/lon/0
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/lon","$max_depth":1}},{"resources.name":{"$regex":"0"}}]}'
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/time/0

Retrieval starts before queue is filled with all requests

In the following example, the first file request is not combined with other requests, although all files being in the same directory.

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0"}}]}'
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.1|0.1.0|0.1.1|1.0.0|1.0.1|1.1.0|1.1.1"}}]}'

Can the StrongLink disk cache be used?

To my understanding, StrongLink will always put files into its own disk cache when retrieving from tape. Is there a way to access this disk cache and have a command similar to slk retrieve, but instead of copying the file to another location (e.g. /scratch), the tool could return the path to the StrongLink cache.

This would

  • simplify the slkspec interface (no per-user cache location has to be specified)
  • reduce the amount of storage required (because I assume StrongLink is smart enough to deduplicate cache entries for multiple users requesting the same file)

Issue with get_mapper

Originally posted by @wachsylon in #3

    Just adding things from the chat:

Would be nice if mappers could work:

import fsspec
import os
if "slk" not in os.environ["PATH"]:
    os.environ["PATH"]=os.environ["PATH"]+":/sw/spack-levante/slk-3.3.67-jrygfs/bin/:/sw/spack-levante/openjdk-17.0.0_35-k5o6dr/bin"
SLK_CACHE="/scratch/k/k204210/INTAKE"
%env SLK_CACHE={SLK_CACHE}

a=fsspec.get_mapper("slk:///arch/ik1017/cmip6/CMIP6/")
b=fsspec.get_mapper(SLK_CACHE)
target_name="AerChemMIP_002.tar"
b[target_name]=a[target_name]

TypeError                                 Traceback (most recent call last)
Cell In [8], line 2
      1 target_name="AerChemMIP_002.tar"
----> 2 b[target_name]=a[target_name]

File ~/.conda/envs/slkspecenv/lib/python3.10/site-packages/fsspec/mapping.py:163, in FSMap.__setitem__(self, key, value)
    161 key = self._key_to_str(key)
    162 self.fs.mkdirs(self.fs._parent(key), exist_ok=True)
--> 163 self.fs.pipe_file(key, maybe_convert(value))

File ~/.conda/envs/slkspecenv/lib/python3.10/site-packages/fsspec/spec.py:737, in AbstractFileSystem.pipe_file(self, path, value, **kwargs)
    735 """Set the bytes of given file"""
    736 with self.open(path, "wb", **kwargs) as f:
--> 737     f.write(value)

File ~/.conda/envs/slkspecenv/lib/python3.10/site-packages/fsspec/implementations/local.py:340, in LocalFileOpener.write(self, *args, **kwargs)
    339 def write(self, *args, **kwargs):
--> 340     return self.f.write(*args, **kwargs)

TypeError: a bytes-like object is required, not '_io.BufferedReader'

Originally posted by @wachsylon in #3 (comment)

Retrieved files not found

There is currently an issue with SLK/pyslk so that the retrievals are saved at a different spot than previously/expected.
Currently both slkspec and slk create subdirectories. Retrieved files end up in directories like <SLK_CACHE>/arch/path/to/file/arch/path/to/file/file

When the expected behaviour is clarified upstream, slkspec will be adjusted if needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.