observingclouds / slkspec Goto Github PK

fsspec filesystem for stronglink tape archive

Python 99.61% Makefile 0.39%

slkspec's Introduction

slkspec

This is work in progress! This repository showcases how the tape archive can be integrated into the scientific workflow.

Pull requests are welcomed!

import fsspec

with fsspec.open("slk:///arch/project/user/file", "r") as f:
    print(f.read())

Loading datasets

import ffspec
import xarray as xr

url = fsspec.open("slk:////arch/project/file.nc", slk_cache="/scratch/b/b12346").open()
dset = xr.open_dataset(url)

Usage in combination with preffs

Installation of additional requirements

mamba env create
mamba activate slkspec
pip install .[preffs]

Open parquet referenced zarr-file

import xarray as xr
ds = xr.open_zarr(f"preffs::/path/to/preffs/data.preffs",
                storage_options={"preffs":{"prefix":"slk:///arch/<project>/<user>/slk/archive/prefix/"}

Now only those files are retrieved from tape which are needed for any requested dataset operation. In the beginning only the file containing the metadata (e.g. .zattrs, .zmetadata) and coordinates are requested (e.g. time). After the files have been retrieved once, they are saved at the path given in SLK_CACHE and accessed directly from there.

slkspec's People

Contributors

Stargazers

Watchers

Forkers

freva-clint florianziemen

slkspec's Issues

support for `fsspec.open_local`

I'd expect that fsspec.open_local("slk://...") would work and would return the path to a cached copy of some file from tape in my filesystem. However, it raises an error:

ValueError: open_local can only be used on a filesystem which has attribute local_file=True

Files remain in queue when opened but not read

Some of the current tests load files from the mock-up tape archive but do not read them. As a consequence self._cache_files() is never called and as such self._file_queue.get() and self._file_queue.task_done(). Those files remain in the queue and cause problems in other tests.

test_reading_dataset cannot be currently called after calling test_ro_mode

Opening files in a context-manager is probably a good idea, but we might also want to have a way to reset the queue or only process the part of the queue, that we are actually requesting.

Add xarray accessor to retrieve files from tape

In addition to xr.Dataset().load() and xr.Dataset().persist() a method that retrieves files from the tape but does not load them into memory would be a useful addition here. xr.Dataset().retrieve() or xr.Dataset().cache() might be options.

Combination of retrievals

There are a couple of examples where the combination of retrievals is not working efficiently yet or has some issues:

Retrievals in combination with xarray depend on the used cluster

xarray seems to request different amounts of files concurrently depending on the cluster configuration:

levante interactive

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0|0.0.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.1.0|0.1.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"1.0.0|1.0.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"1.1.0|1.1.1"}}]}'

levante compute

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0|0.0.1|0.1.0|0.1.1|1.0.0|1.0.1|1.1.0|1.1.1"}}]}'

Issues with intake-esm

    On levante:

import intake
import json
import pandas as pd
with open("/pool/data/Catalogs/dkrz_cmip6_disk.json") as f:
    catconfig=json.load(f)
testcat=intake.open_esm_datastore(esmcol_obj=pd.read_csv("/home/k/k204210/intake-esm/catalogs/dkrz_cmip6_archive.csv.gz"),
                                  esmcol_data=catconfig)
subset=testcat.search(source_id="MPI-ESM1-2-LR",
               experiment_id="ssp370",
               variable_id="tas",
              table_id="Amon")
import os
if "slk" not in os.environ["PATH"]:
    os.environ["PATH"]=os.environ["PATH"]+":/sw/spack-levante/slk-3.3.67-jrygfs/bin/:/sw/spack-levante/openjdk-17.0.0_35-k5o6dr/bin"
SLK_CACHE="/scratch/k/k204210/INTAKE"
%env SLK_CACHE={SLK_CACHE}
subset.to_dataset_dict()

calls 33 slk retrieves which call 33 /sw/spack-levante/slk-3.3.67-jrygfs/lib/slk-cli-tools-3.3.67.jar retrieve (66 processes in total) for 10 unique tars from the same directory on hsm. that cant be right

Originally posted by @wachsylon in #3 (comment)

Combination of retrievals is not working when using url chaining

import xarray as xr
import fsspec
m=fsspec.open_files(['simplecache::slk:///arch/mh0010/m300408/showcase/dataset*.nc'])
with m as f:
    xr.open_mfdataset(f, engine="h5netcdf")

Temporary Files

Some feature requests:

what happens if I delete the temporary file before retrieval started? It seems to me like the process hangs...
Is there a way to resume an old retrieval? Could we add that?

Queue is not emptied after process fails

The retrieval queue is not emptied when a retrieval failed, leaving the finish pseudo retrieval and the failed retrieval in the queue. A new retrieval fails because finish cannot be found.

(Pdb) self._file_queue.queue
deque([('finish', 'finish')])

Make slkspec asynchronous

To allow for efficient retrievals, slkspec needs to become asynchronous and collect requests in a queue to bundle those requests efficiently to a sly retrieve command.

Limit processing of queued files to 500 per job

On the compute, shared and interactive partitions, slk retrieve is allowed to retrieve 500 files at once. Thus, if more than 500 files are requested here, it should be split into several retrievals. @antarcticrainforest Or would you split up the file list to parts <501 anyway before calling this function? I'll try to implement this feature ("group_files_by_tape") to the slk_helpers as soon as possible. But currently, I am mainly bound to slk testing and user support. So, let's see ;-) .

Originally posted by @neumannd in #3 (comment)

Combining retrievals from different subdirectories

Currently, retrievals from different subdirectories are not combined. Initially this was done because an older version of slk did not create subdirectories locally. This has been changed in slk to allow more efficient retrievals. slkspec should reflect these changes, so that independent of the directory, retrievals can be merged.

The following code can be used as a test:

import xarray as xr
xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")

Currently the output looks like the following:

# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/.zmetadata
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/time/0
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/time","$max_depth":1}},{"resources.name":{"$regex":"0"}}]}'
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/time/0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/lat/0
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/lat","$max_depth":1}},{"resources.name":{"$regex":"0"}}]}'
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/lon/0
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/lon","$max_depth":1}},{"resources.name":{"$regex":"0"}}]}'
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/time/0

Retrieval starts before queue is filled with all requests

In the following example, the first file request is not combined with other requests, although all files being in the same directory.

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
# /scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0"}}]}'
# slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.1|0.1.0|0.1.1|1.0.0|1.0.1|1.1.0|1.1.1"}}]}'

Can the StrongLink disk cache be used?

To my understanding, StrongLink will always put files into its own disk cache when retrieving from tape. Is there a way to access this disk cache and have a command similar to slk retrieve, but instead of copying the file to another location (e.g. /scratch), the tool could return the path to the StrongLink cache.

This would

simplify the slkspec interface (no per-user cache location has to be specified)
reduce the amount of storage required (because I assume StrongLink is smart enough to deduplicate cache entries for multiple users requesting the same file)

Issue with get_mapper

Originally posted by @wachsylon in #3

    Just adding things from the chat:

Would be nice if mappers could work:

import fsspec
import os
if "slk" not in os.environ["PATH"]:
    os.environ["PATH"]=os.environ["PATH"]+":/sw/spack-levante/slk-3.3.67-jrygfs/bin/:/sw/spack-levante/openjdk-17.0.0_35-k5o6dr/bin"
SLK_CACHE="/scratch/k/k204210/INTAKE"
%env SLK_CACHE={SLK_CACHE}

a=fsspec.get_mapper("slk:///arch/ik1017/cmip6/CMIP6/")
b=fsspec.get_mapper(SLK_CACHE)
target_name="AerChemMIP_002.tar"
b[target_name]=a[target_name]

TypeError                                 Traceback (most recent call last)
Cell In [8], line 2
      1 target_name="AerChemMIP_002.tar"
----> 2 b[target_name]=a[target_name]

File ~/.conda/envs/slkspecenv/lib/python3.10/site-packages/fsspec/mapping.py:163, in FSMap.__setitem__(self, key, value)
    161 key = self._key_to_str(key)
    162 self.fs.mkdirs(self.fs._parent(key), exist_ok=True)
--> 163 self.fs.pipe_file(key, maybe_convert(value))

File ~/.conda/envs/slkspecenv/lib/python3.10/site-packages/fsspec/spec.py:737, in AbstractFileSystem.pipe_file(self, path, value, **kwargs)
    735 """Set the bytes of given file"""
    736 with self.open(path, "wb", **kwargs) as f:
--> 737     f.write(value)

File ~/.conda/envs/slkspecenv/lib/python3.10/site-packages/fsspec/implementations/local.py:340, in LocalFileOpener.write(self, *args, **kwargs)
    339 def write(self, *args, **kwargs):
--> 340     return self.f.write(*args, **kwargs)

TypeError: a bytes-like object is required, not '_io.BufferedReader'