fsspec / filesystem_spec Goto Github PK

View Code? Open in Web Editor NEW

940.0 22.0 341.0 2.78 MB

A specification that python filesystems should adhere to.

License: BSD 3-Clause "New" or "Revised" License

Python 99.79% HTML 0.16% Shell 0.05%

filesystem_spec's People

Stargazers

Watchers

Forkers

mariusvniekerk rabernat albertdefusco martindurant kalefranz crepererum mtrbean hayesgb cav71 alexismignon qulogic asford johnemhoff tomaugspurger hardingnj jrbourbeau isichei sscherfke ikucan hekaisheng acrosby jonathan-hourany ernestoeperez88 s3bw marchub epoch8 hoffmann mic92 dhirschfeld yamrzou xuchen reywood kontinuation michaelchia arthurbarthe airosa limx0 journeymansix dynamix ian-r-rose rsignell-usgs brl0 philipzhaots sodre smeltofelderberries juarezr telamonian rux-pizza masda70 jimathyp carreau dfigus gsy0911 fagan2888 laaytrivedi alxndrpi manics mivade cj-wright moonvision xyb raybellwaves klucar kaaveland scottyhq ysgit shipmints owal-mike jorisvandenbossche allen-cell-animated adriangb pschork fhtmitchell gowerstreet davidkatz-il juliospain namanjain ahirner holdenk arzt wyzhao baxen fnattino cthoyt aaronspring forman mgorny nils-braun earthobservations replicahq bstrdsmkr efiop marireeves andersy005 valhayot espdev d70-t isidentical yarikoptic gerritholl

filesystem_spec's Issues

size_policy is broken

In #46, @martindurant added a size_policy keyword to HTTPFileSystem.

In fsspec 0.2.1, it appears to be broken.

from fsspec.implementations.http import HTTPFileSystem
fs = HTTPFileSystem()
path = 'https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_2160/compressed/0000092160/Theta.0000092160.data.shrunk'
f = fs.open(path, size_policy='get')

raises

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-857d04d27e7c> in <module>()
      2 fs = HTTPFileSystem()
      3 path = 'https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_2160/compressed/0000092160/Theta.0000092160.data.shrunk'
----> 4 f = fs.open(path, size_policy='get')

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/fsspec/spec.py in open(self, path, mode, block_size, **kwargs)
    615             ac = kwargs.pop('autocommit', not self._intrans)
    616             f = self._open(path, mode=mode, block_size=block_size,
--> 617                            autocommit=ac, **kwargs)
    618             if not ac:
    619                 self.transaction.files.append(f)

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/fsspec/implementations/http.py in _open(self, url, mode, block_size, **kwargs)
    131         if block_size:
    132             return HTTPFile(self, url, self.session, block_size,
--> 133                             size_policy=self.size_policy, **kw)
    134         else:
    135             kw['stream'] = True

TypeError: type object got multiple values for keyword argument 'size_policy'

Add snappy and zstd compressions

xref dask/dask#5039

There has been a bit of development since the 0.2.0 release. I'm wondering when the next release is planned. I am about to release a version of xmitgcm that needs the feature added in #46, so I need to decide what to do about the dependency. I would love to just be able to depend on 0.2.1, rather than github master. But I don't want to rush a release unnecessarily.

add dispatch filesystem

Like open/open_files, which find the file-system of interest and use it with parameters, could implement a generic top-level file-system which finds the correct instance and use it depending on the protocol implicit in the path given. This would become the primary user-facing API.

This can also be the place to implement globs or recursive where multiple files make sense.

Should allow copy/move/(sync?) between file-systems.

Attribute error when using hdfs during program exit

I observe an attribute error when the python program/session ends after using the PyArrowHDFS implementation. This does not impact the program in any way during execution(as far as I have observed).

Reproducer:

from fsspec.implementations.hdfs import PyArrowHDFS
fs = PyArrowHDFS()

After exit:

AttributeError: 'HadoopFileSystem' object has no attribute 'close'
Exception ignored in: 'pyarrow.lib.HadoopFileSystem.__dealloc__'
AttributeError: 'HadoopFileSystem' object has no attribute 'close'

comments from arrow

@pitrou wrote the following on [email protected]

Here are some comments:

API naming: you seem to favour re-using Unix command-line monickers in
some places, while using more regular verbs or names in other
places. I think it should be consistent. Since the Unix
command-line doesn't exactly cover the exposed functionality, and
since Unix tends to favour short cryptic names, I think it's better
to use Python-like naming (which is also more familiar to non-Unix
users). For example "move" or "rename" or "replace" instead of "mv",
etc.
**kwargs parameters: a couple APIs (mkdir, put...) allow passing
arbitrary parameters, which I assume are intended to be
backend-specific. It makes it difficult to add other optional
parameters to those APIs in the future. So I'd make the
backend-specific directives a single (optional) dict parameter rather
than a **kwargs.
invalidate_cache doesn't state whether it invalidates recursively
or not (recursively sounds better intuitively?). Also, I think it
would be more flexible to take a list of paths rather than a single
path.
du: the effect of the deep parameter isn't obvious to me. I don't
know what it would mean not to recurse here: what is the size of a
directory if you don't recurse into it?
glob may need a formal definition (are trailing slashes
significant for directory or symlink resolution? this kind of thing),
though you may want to keep edge cases backend-specific.
are head and tail at all useful? They can be easily recreated
using a generic open facility.
read_block tries to do too much in a single API IMHO, and
using open directly is more flexible anyway.
if touch is intended to emulate the Unix API of the same name, the
docstring should state "Create empty file or update last modification
timestamp".
the information dicts returned by several APIs (ls, info....)
need standardizing, at least for non backend-specific fields.
if the backend is a networked filesystem with non-trivial latency,
perhaps the operations would deserve being batched (operate on
several paths at once), though I will happily defer to your expertise
on the topic.

Support encoding for LocalFileOpener

Currently local file open() supports only mode. It would be useful to add at least encoding (since it's implemented e.g. for S3FileSystem) if not all builtin open parameters (buffering, errors, etc).

Question: What is the difference between this and pyfilesystem?

The description mentions pyfilesystem, but doesn't explain how this is different. What are the differences and why is a new filesystem API needed? It sounds like the other one might already have a community so working within that API (if possible) could make sense.

Support compound URLs

With the cached FS and dask worker FS, not to mention zip, we have file-systems which take as input other file-systems or URLs. It would be convenient, but potentially confusing, to be able to address them with a single url:

"zip://s3://bucket/file.zip"

In practice, it may be tricky to differentiate between the parts of the path belong to which FS. This would probably need more syntax and maybe lead to more confusion than it clears up.

add recursive glob

As discussed in

dask/dask#4186

All file-systems should support full glob.

Also, recursive should be allowd in various methods (see fsspec/gcsfs#139 ). This is somewhat similar to #17

Cannot install on python 3.6.7

Am I missing something?

$ python3 -V
Python 3.6.7
$ pip3 install fsspec
Collecting fsspec
  Using cached https://files.pythonhosted.org/packages/00/cd/f3636189937a08830ebb488c6acd1ccb9c4f7f54e55c62d0106f675a4887/fsspec-0.1.2.tar.gz
Collecting python>=3.6 (from fsspec)
  Could not find a version that satisfies the requirement python>=3.6 (from fsspec) (from versions: )
No matching distribution found for python>=3.6 (from fsspec)

Glob search breaks when '?' occurs before '*'

I originally filed this under s3fs (here), but adding it here as related:

Consider for example:
fs.glob('mybucket/*/????/*.zip')
returns
['mybucket/folder/2018/file.zip', 'mybucket/folder/2019/file.zip']
whereas
fs.glob('mybucket/folder/????/*.zip')
returns
[]

If the * is after the ?, the ? ends up in the path that gets walked, and so it doesn't find the path. Instead of searching the * only, this bit should consider other characters:
https://github.com/intake/filesystem_spec/blob/c0c4f2bfa69127f5d6daf81b98b8584669041468/fsspec/spec.py#L417

glob doesn't pass extra parameters to ls

according to this: https://github.com/intake/filesystem_spec/blob/master/fsspec/spec.py#L420 glob should pass extra kwargs to the ls function.

I tried using the details=True flag, but got an exception

import fsspec
fs = fsspec.get_filesystem_class('s3')()
fs.glob('s3://bucket/*', detail=True)

  File "fsspec_test.py", line 3, in <module>
    fs.glob('s3://bucket/*', detail=True)
  File "./lib/python3.7/site-packages/fsspec/spec.py", line 444, in glob
    allpaths = self.find(root, maxdepth=depth, **kwargs)
  File "./lib/python3.7/site-packages/fsspec/spec.py", line 373, in find
    for path, _, files in self.walk(path, maxdepth, **kwargs):
TypeError: walk() got an unexpected keyword argument 'detail'

Implement `fspath`

PEP-519 added paths-like support to arbitrary file-like things. We should support that and return full URLs, including protocols.

What to do about FSs that don’t glob?

Many file system libraries support glob strings (s3fs, gcsfs, and some of the libraries in dask.bytes), however some like HTTPFileSystem do not. In this case, it is difficult (if not impossible) since HTTP servers don't always list directories.

What are other examples?
Would is be possible to come up with a standard?
Or perhaps a fall back plan? For example, with the case of HTTPFileSystem, try to list with a GET request to the parent path to list and then match on the glob string. If the GET request fails, fall back to S3FS, GCSFS, or FTPFS.

Trailing slashes

Should address what to do about trailing slashes/.

known_implementations showing gcsfs with err even when gcsfs installed

I'm trying to debug gcsfs not working in Dask and tracing back to fsspec via known_implementations. It says there is an error and requires gcsfs to be installed (which it is):

>>> import gcsfs
>>> from fsspec.registry import known_implementations
>>> known_implementations
{'file': {'class': 'fsspec.implementations.local.LocalFileSystem'}, 'memory': {'class': 'fsspec.implementations.memory.MemoryFileSystem'}, 'http': {'class': 'fsspec.implementations.http.HTTPFileSystem', 'err': 'HTTPFileSystem requires "requests" to be installed'}, 'https': {'class': 'fsspec.implementations.http.HTTPFileSystem', 'err': 'HTTPFileSystem requires "requests" to be installed'}, 'zip': {'class': 'fsspec.implementations.zip.ZipFileSystem'}, 'gcs': {'class': 'gcsfs.GCSFileSystem', 'err': 'Please install gcsfs to access Google Storage'}, 'gs': {'class': 'gcsfs.GCSFileSystem', 'err': 'Please install gcsfs to access Google Storage'}, 'sftp': {'class': 'fsspec.implementations.sftp.SFTPFileSystem', 'err': 'SFTPFileSystem requires "paramiko" to be installed'}, 'ssh': {'class': 'fsspec.implementations.sftp.SFTPFileSystem', 'err': 'SFTPFileSystem requires "paramiko" to be installed'}, 'ftp': {'class': 'fsspec.implementations.ftp.FTPFileSystem'}, 'hdfs': {'class': 'fsspec.implementations.hdfs.PyArrowHDFS', 'err': 'pyarrow and local java libraries required for HDFS'}, 'webhdfs': {'class': 'fsspec.implementations.webhdfs.WebHDFS', 'err': 'webHDFS access requires "requests" to be installed'}, 's3': {'class': 's3fs.S3FileSystem', 'err': 'Install s3fs to access S3'}, 'cached': {'class': 'fsspec.implementations.cached.CachingFileSystem'}, 'dask': {'class': 'fsspec.implementations.dask.DaskWorkerFileSystem', 'err': 'Install dask distributed to access worker file system'}}
>>>

This is fsspec 0.4.1 and gcsfs 0.3.0. Should known_implementations be listing it without an err key if it is working correctly?

Explicit "here's what to implement guide"

Hi!

I am one of the maintainers of stor a filesystem library that attempts to implement a cross-compatible API for POSIX, OpenStack Swift, S3, and DNAnexus paths. @kyleabeauchamp pointed out your spec and it seemed pretty interesting :) . I am considering providing an alternate version of stor.Path that implements the filesystem_spec API. I know I'm combining a couple things here, but I didn't see a mailing list so I figured I'd ask 'em together.

What are the specific methods I need to implement to be compatible with the system? What's the minimal set? Is there an "advanced" set of methods? Is the minimal set all methods on AbstractFileSystem?
What does maxdepth mean in the context of S3 or OpenStack Swift? Is it referring to directories delimited by /? Is there a particular ordering that's required? (for example, is it necessary to guarantee that you fully list each "directory" before emitting files in another directory?)

Aside - as a general comment, I wonder if the API would be better served by having all methods work as iterators and/or have all methods support some kind of limit argument. We originally implemented stor to have most file listing methods return lists; in hindsight, this was a mistake - it's much too easy to accidentally write code that tries to list millions of objects, and thus gets really slow or unintentionally expensive.

Interested to test out the system more! :)

Jeff

Surprising behavior of open_files()

h5py now supports file-like objects. This works with the built-in open(), but I get an error when I try this with fsspec:

import h5py
import fsspec

with fsspec.open_files('test.h5', 'wb')[0] as f:
  with h5py.File(f, 'w') as h5file:
    h5file.attrs['foo'] = 'bar'

---------------------------------------------------------------------------
NotADirectoryError                        Traceback (most recent call last)
<ipython-input-15-bc0b4a6a665e> in <module>()
      2 import fsspec
      3 
----> 4 with fsspec.open_files('test.h5', 'wb')[0] as f:
      5   with h5py.File(f, 'w') as h5file:
      6     h5file.attrs['foo'] = 'bar'

/usr/local/lib/python3.6/dist-packages/fsspec/core.py in __enter__(self)
     57         mode = self.mode.replace('t', '').replace('b', '') + 'b'
     58 
---> 59         f = self.fs.open(self.path, mode=mode)
     60 
     61         fobjects = [f]

/usr/local/lib/python3.6/dist-packages/fsspec/spec.py in open(self, path, mode, block_size, **kwargs)
    562             ac = kwargs.pop('autocommit', not self._intrans)
    563             f = self._open(path, mode=mode, block_size=block_size,
--> 564                            autocommit=ac, **kwargs)
    565             if not ac:
    566                 self.transaction.files.append(f)

/usr/local/lib/python3.6/dist-packages/fsspec/implementations/local.py in _open(self, path, mode, block_size, **kwargs)
     70 
     71     def _open(self, path, mode='rb', block_size=None, **kwargs):
---> 72         return LocalFileOpener(path, mode, **kwargs)
     73 
     74     def touch(self, path, **kwargs):

/usr/local/lib/python3.6/dist-packages/fsspec/implementations/local.py in __init__(self, path, mode, autocommit)
     85         self.autocommit = autocommit
     86         if autocommit or 'w' not in mode:
---> 87             self.f = open(path, mode=mode)
     88         else:
     89             # TODO: check if path is writable?

NotADirectoryError: [Errno 20] Not a directory: 'test.h5/0.part'

Working example with open():

with open('test.h5', 'wb') as f:
  with h5py.File(f, 'w') as h5file:
    h5file.attrs['foo'] = 'bar'

HTTPfile object has no attribute kwargs

urlsmall = 'https://six-library.s3.amazonaws.com/sicd_example_RMA_RGZERO_RE32F_IM32F_cropped_multiple_image_segments.nitf'
#test = cf_sicd.Reader(urlsmall)
test = fsspec.open(urlsmall)

test.fs.exists(test.path)

True

with test as fid:
fid.read(9)

AttributeError Traceback (most recent call last)
in
----> 1 with test as fid:
2 fid.read(9)

C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+28.g9b388d4-py3.7.egg\fsspec\core.py in enter(self)
57 mode = self.mode.replace('t', '').replace('b', '') + 'b'
58
---> 59 f = self.fs.open(self.path, mode=mode)
60
61 fobjects = [f]

C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+28.g9b388d4-py3.7.egg\fsspec\spec.py in open(self, path, mode, block_size, **kwargs)
584 ac = kwargs.pop('autocommit', not self._intrans)
585 f = self._open(path, mode=mode, block_size=block_size,
--> 586 autocommit=ac, **kwargs)
587 if not ac:
588 self.transaction.files.append(f)

C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+28.g9b388d4-py3.7.egg\fsspec\implementations\http.py in _open(self, url, mode, block_size, **kwargs)
128 kw.pop('autocommit', None)
129 if block_size:
--> 130 return HTTPFile(self, url, self.session, block_size, **kw)
131 else:
132 kw['stream'] = True

C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+28.g9b388d4-py3.7.egg\fsspec\implementations\http.py in init(self, fs, url, session, block_size, mode, **kwargs)
175 try:
176 size = file_size(url, self.session, allow_redirects=True,
--> 177 **self.kwargs)
178 except (ValueError, requests.HTTPError):
179 # No size information - only allow read() and no seek()

AttributeError: 'HTTPFile' object has no attribute 'kwargs'

A simpler caching file-system

The caching file-system downloads from other file systems to a local store, and uses MMap to only fetch the chunks that are accessed, to a sparse file. An alternative mode of the same class would simply download whole files on first access, potentially with on-the-fly decompression, and then provide the local copy. This could then replace most/all of the functionality in Intake's cache layer (although who takes responsibility for metadata for checkpointing remains to be decided).

using rm with recursive=True on a file should succeed

>>> import os, fsspec
>>> fsspec.__version__
'0.3.6'
>>> fs = fsspec.filesystem("file")
>>> fs.touch(f"{os.getcwd()}/testfile")
>>> fs.rm(f"{os.getcwd()}/testfile", recursive=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kfranz/ae/repo/env/lib/python3.7/site-packages/fsspec/implementations/local.py", line 90, in rm
    shutil.rmtree(path)
  File "/Users/kfranz/ae/repo/env/lib/python3.7/shutil.py", line 491, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/Users/kfranz/ae/repo/env/lib/python3.7/shutil.py", line 410, in _rmtree_safe_fd
    onerror(os.scandir, path, sys.exc_info())
  File "/Users/kfranz/ae/repo/env/lib/python3.7/shutil.py", line 406, in _rmtree_safe_fd
    with os.scandir(topfd) as scandir_it:
NotADirectoryError: [Errno 20] Not a directory: '/Users/kfranz/ae/repo/testfile'

Add stdlib-compatible shim layer.

@martindurant This is a trial-balloon issue for this idea, mostly to see if fsspec is the "right" place for this feature and/or if I've missed an existing implementation. If so, I've a working proof-of-concept built off an older version of dask that could serve as a starting-point PR.

fsspec is a great module for new development or projects built on the existing dask ecosystem and enables an amazing S3-is-my-filesystem paradigm. However, the vast majority of projects make use of the python's built-in file operations, and are tightly coupled to a local filesystem. This causes major development friction when integrating existing third-party libraries into a project, as one almost invariably needs to work out an integration-specific flow between local files and remote storage.

One can workaround this problem with FUSE-based mounts, however this complexifies deployment and containerization. Alternatively, any of several (fsspec, smart_open, pyfilesystem2, et. al.) filesystem abstractions could be used, but updating a third-party component to an alternative filesystem interface is a painful and risky development lift. One either needs to maintain a private fork 😳 or open a massive and risky PR 😬.

I've found that a solid majority of filesystem use cases are covered by a relatively small set of operations, all of which is already covered by fsspec. By providing strictly-compatible shims for a small set of the stdlib (eg: open, os.path.exists, os.remove, glob.glob, shutil.rmtree, et. al.) and then swapping these via import level changes one can quickly teach most libraries to seamlessly interact with all the file systems supported by fsspec.

This shim layer would mandate strict adherence to standard library semantics for local file operations, likely by directly forwarding all local paths into the standard library and forwarding non-local paths through fsspec-based implementations. The explicit goal would be to enable a majority of basic use cases, deferring to fsspec interfaces for more robust integration and/or specialized use cases. This would turn fsspec into a massively useful layer for updating existing systems to cloud-compatible storage, as updating a library to support s3 and gcsfs would be as simple as:

try:
    from fsspec.stdlib import open
    import fsspec.stdlib.os.path as os.path
    import fsspec.stdlib.shutil as shutil
except ImportError:
    import os.path
    import shutil

known_implementations should validate the minimum version

This would prevent issues like dask/dask#5298, where a user had gcsfs 0.2.3. 0.3.0 is required to work with fsspec.

numpy.fromfile from https

Question:
I have a final hurdle in getting https files read, metadata gets read correctly, just np.fromfile doesn't like the fid.
It's not clear to me if readbytes is a close replacement.
Since we end up looping, using np.fromfile and fid.seek
I suspect there's a cleaner way now using fsspec to subsample (seek to where we need).
What's the best np.fromfile replacement to use fsspec?

Output format

https://github.com/martindurant/filesystem_spec/blob/825c894d7634ae5546b15be60f5c583bc37edc5d/fsspec/spec.py#L90-L94

We should specify here that this output format has the same structure as os..walk

Best way to determine file or url exists?

with of as f:
f.exists(path)

That's fine for local files. However, I ultimately want to use this in underlying code that intake will use, and path being https, s3 or c:\ can all be valid possibilities.
HTTPfile doesn't have an exists(), what is the recommended way to do this?

tag 0.3.6

Could you please tag 0.3.6 in git repo?

BytesCache bug leading to duplicate file reads for small files on remote storage

I'm working on an Azure Datalake Gen2 package using fsspec and encountered strange behavior with BytesCacheing, where the file being read from the datalake and calling compute() on the cached dataframe appears as an exact duplicate of rows of the same dataframe.

It appears to be caused by the _fetch method in BytesCache when reading a file into Dask. During what should be the first read when calling "dd.read_csv()", the value of start (when none) is passed as zero, but when calling the .compute() method on the dataframe, the file gets read again, leading to the duplication of rows. Explicitly setting start, end= 0,0 like:

def _fetch(self, start, end):
        # TODO: only set start/end after fetch, in case it fails?
        # is this where retry logic might go?
        if self.start is None and self.end is None:
            # First read
            start, end = 0, 0
            self.cache = self.fetcher(start, end + self.blocksize)
            self.start = start

appears to fix it for me. Have you observed any issues with BytesCacheing elsewhere?

Non-plural open_files

It would be nice to have an alias that only opened a single file, e.g., fsspec.open_file or even fsspec.open. Right now idiomatic usage of fsspec for a single file looks pretty awkward with the need for indexing like [0], e.g.,

with fsspec.open_files(path)[0] as f:
    f.write(...)

Committer

Would like some general method by which files can be written to a temporary location, probably on the same file-system, and then all committed together later, when all writing has completed; or aborting the whole process and leaving the target location unchanged. This makes the writing process (almost) atomic.

Would ideally be not just for single files, but for groups such as "all files below path/" or "all writes until a commit is requested". Of course, the default would still be to auto-commit, to write to the real final location as soon as possible.

In a FS like HDFS, this would be by mv-ing the files from the temporary location. In a FS like S3, uploads of files larger than the smallest block size uses a blocked upload, which needs finalising anyway, and can be aborted (whereas there is no move, only copy-and-delete). Exactly how an implementation achieves temporary writes is not to be defined here.

Importantly, the concept should be portable across instances of the file-system in different processes, so that a batch of files can be written in parallel, e.g., with Dask, and committed together. Perhaps each instance notes only the files that it writes, but then all must still be around at commit time. I do not have a good idea of how that might be achieved, suggestions welcome.

google DRIVE implementation

Most of us university folks have essentially unlimited free storage in google drive. It would be great to be able to work with this storage space in a cloud-native way.

The google drive SDK includes a python API: https://developers.google.com/drive/api/v3/about-sdk

In principle, it should be straightforward to create a google-drive implementation for fsspec.

I believe this would open up a whole world of cool possibilities. In particular, it would lower the bar for putting data on the "cloud." More people are comfortable with google drive than they are with S3 or GCS.

readd tests

Currently, fuse, dask and hdfs are not being tested on Travis, although tests exist, leading to apparently low code coverage.
Each of these has had working fixtures in the past, and it would be reasonable to make the effort to add them again. HDFS may not be tractable, since it must be run on an edge node - we can indeed run it (dask nightlies is doing this), but the coverage will not show up.

Question about glob implementation

Currently the default implementation of glob is handwritten for this package, and relies on find and some custom regexes (which have given problems in the past/currently). An alternative implementation may be similar to what we used to have in dask. This is based on the implementation in CPython, passes all their tests, and relies on the following standard filesystem methods:

ls
isdir
exists

Is there a reason to prefer the implementation in this repo over the one generalized from CPython we used to have in dask? Would it be worth investigating using the other version?

Issue with Dask read_csv from S3

when doing

import pandas as pd
import s3fs
import dask.dataframe as dd

ddf=dd.read_csv("s3://bucket/data/path.csv")

getting the following error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-64a49ed7781b> in <module>
----> 1 df = dd.read_csv("s3://amra-s3-0066-00-external/promotion-data/transactions/Kay_trans_data_201806_202005.csv", storage_options={'token': fs})

/opt/anaconda/envs/promo/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    580             storage_options=storage_options,
    581             include_path_column=include_path_column,
--> 582             **kwargs
    583         )
    584 

/opt/anaconda/envs/promo/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    407         compression=compression,
    408         include_path=include_path_column,
--> 409         **(storage_options or {})
    410     )
    411 

/opt/anaconda/envs/promo/lib/python3.7/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs)
     93 
     94     """
---> 95     fs, fs_token, paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)
     96 
     97     if len(paths) == 0:

/opt/anaconda/envs/promo/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol)
    313         cls = get_filesystem_class(protocol)
    314 
--> 315         options = cls._get_kwargs_from_urls(urlpath)
    316         path = cls._strip_protocol(urlpath)
    317         update_storage_options(options, storage_options)

AttributeError: type object 'S3FileSystem' has no attribute '_get_kwargs_from_urls'

How to use fsspec with persistent local cache in a predefined directory

I'm looking for more details about using local cache with fsspec
specifically:

how to specify if using cache
how to specify if the cache should be cleaned on process exit
how to control the location of the cache
if a file is in the cache, I assume the version of the file is checked to verify the cache is up to date - is it possible o disable this? (for cases I trust the remote files don't change)

local cache file-system

We could create a file-system type, which makes local copies of any file which is opened for reading on a given other file-system: download-on-first-access. That would allow for local, and therefore fast, access to some parts of a much bigger data-set. This is probably not much work, but we would need to decide how and where to to store the files.

The MMap file-cache also demonstrates only downloading those blocks of a file which are accessed into a sparse file. That could also be implemented for absolute minimal storage.

@rabernat , this is the kind of thing you had in mind, and it is an interesting idea, and probably not hard to implement.

accept Path objects

Reuse https://github.com/dask/dask/pull/3335/files#diff-a8ff31fbc7d5c0e42fc6bafd7f7b9197R271 in functions that take paths to also allow Path-like objects and fspath protocol things. This would be in functions like open_files.

put and get with file fs should support recursive=True

Currently just implemented with shutil.copyfile(), i.e.

    def copy(self, path1, path2, **kwargs):
        """ Copy within two locations in the filesystem"""
        shutil.copyfile(path1, path2)

    get = copy
    put = copy

which only works for single files.

Handling incorrect content-length headers

From intake/intake#367

>>> x = len(dd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv").compute())
>>> y = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv")
>>> len(x), len(y)
(519, 1458)

@martindurant thinks this may be because Github's HTTP server isn't respecting the 'identity' request header, and is sending back the compressed content length instead.

How should we handle this situation?

HTTPFileSystem hanging on open

I am trying to use fsspec.implementations.http.HTTPFileSystem to read a byte range from a remote url. Here's what I'm doing

from fsspec.implementations.http import HTTPFileSystem
path = 'https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_4320/compressed/0000010368/Theta.0000010368.data.shrunk'
fs = HTTPFileSystem()
f_obj = fs.open(path)
# ... later would come the reading commands

This hangs forever. I have turned on urllib3 debugging to see what's happening. I get the following log output

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): data.nas.nasa.gov:443

If I interrupt the running command, I get this stack trace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/anaconda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    376             try:  # Python 2.7, use buffering of HTTP responses
--> 377                 httplib_response = conn.getresponse(buffering=True)
    378             except TypeError:  # Python 3

TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-27-556e6da93e62> in <module>
----> 1 f_obj = fs.open(path)
      2 f_obj

~/anaconda/envs/pangeo/lib/python3.6/site-packages/fsspec/spec.py in open(self, path, mode, block_size, **kwargs)
    562             ac = kwargs.pop('autocommit', not self._intrans)
    563             f = self._open(path, mode=mode, block_size=block_size,
--> 564                            autocommit=ac, **kwargs)
    565             if not ac:
    566                 self.transaction.files.append(f)

~/anaconda/envs/pangeo/lib/python3.6/site-packages/fsspec/implementations/http.py in _open(self, url, mode, block_size, **kwargs)
    117         kw.pop('autocommit', None)
    118         if block_size:
--> 119             return HTTPFile(url, self.session, block_size, **kw)
    120         else:
    121             kw['stream'] = True

~/anaconda/envs/pangeo/lib/python3.6/site-packages/fsspec/implementations/http.py in __init__(self, url, session, block_size, **kwargs)
    165         try:
    166             self.size = file_size(url, self.session, allow_redirects=True,
--> 167                                   **self.kwargs)
    168         except (ValueError, requests.HTTPError):
    169             # No size information - only allow read() and no seek()

~/anaconda/envs/pangeo/lib/python3.6/site-packages/fsspec/implementations/http.py in file_size(url, session, **kwargs)
    370     head = kwargs.get('headers', {})
    371     head['Accept-Encoding'] = 'identity'
--> 372     r = session.head(url, allow_redirects=ar, **kwargs)
    373     r.raise_for_status()
    374     if 'Content-Length' in r.headers:

~/anaconda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in head(self, url, **kwargs)
    566 
    567         kwargs.setdefault('allow_redirects', False)
--> 568         return self.request('HEAD', url, **kwargs)
    569 
    570     def post(self, url, data=None, json=None, **kwargs):

~/anaconda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    531         }
    532         send_kwargs.update(settings)
--> 533         resp = self.send(prep, **send_kwargs)
    534 
    535         return resp

~/anaconda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
    644 
    645         # Send the request
--> 646         r = adapter.send(request, **kwargs)
    647 
    648         # Total elapsed time of the request (approximately)

~/anaconda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    447                     decode_content=False,
    448                     retries=self.max_retries,
--> 449                     timeout=timeout
    450                 )
    451 

~/anaconda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    598                                                   timeout=timeout_obj,
    599                                                   body=body, headers=headers,
--> 600                                                   chunked=chunked)
    601 
    602             # If we're going to release the connection in ``finally:``, then

~/anaconda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    378             except TypeError:  # Python 3
    379                 try:
--> 380                     httplib_response = conn.getresponse()
    381                 except Exception as e:
    382                     # Remove the TypeError from the exception chain in Python 3;

~/anaconda/envs/pangeo/lib/python3.6/http/client.py in getresponse(self)
   1329         try:
   1330             try:
-> 1331                 response.begin()
   1332             except ConnectionError:
   1333                 self.close()

~/anaconda/envs/pangeo/lib/python3.6/http/client.py in begin(self)
    295         # read until we get a non-100 response
    296         while True:
--> 297             version, status, reason = self._read_status()
    298             if status != CONTINUE:
    299                 break

~/anaconda/envs/pangeo/lib/python3.6/http/client.py in _read_status(self)
    256 
    257     def _read_status(self):
--> 258         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    259         if len(line) > _MAXLINE:
    260             raise LineTooLong("status line")

~/anaconda/envs/pangeo/lib/python3.6/socket.py in readinto(self, b)
    584         while True:
    585             try:
--> 586                 return self._sock.recv_into(b)
    587             except timeout:
    588                 self._timeout_occurred = True

~/anaconda/envs/pangeo/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in recv_into(self, *args, **kwargs)
    295     def recv_into(self, *args, **kwargs):
    296         try:
--> 297             return self.connection.recv_into(*args, **kwargs)
    298         except OpenSSL.SSL.SysCallError as e:
    299             if self.suppress_ragged_eofs and e.args == (-1, 'Unexpected EOF'):

~/anaconda/envs/pangeo/lib/python3.6/site-packages/OpenSSL/SSL.py in recv_into(self, buffer, nbytes, flags)
   1819             result = _lib.SSL_peek(self._ssl, buf, nbytes)
   1820         else:
-> 1821             result = _lib.SSL_read(self._ssl, buf, nbytes)
   1822         self._raise_ssl_error(self._ssl, result)
   1823 

KeyboardInterrupt:

Everything works fine if I use requests.

import requests
r = requests.get(path, headers={"Range": "bytes=0-100"})

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): data.nas.nasa.gov:443
DEBUG:urllib3.connectionpool:https://data.nas.nasa.gov:443 "GET /ecco/download_data.php?file=/eccodata/llc_4320/compressed/0000010368/Theta.0000010368.data.shrunk HTTP/1.1" 206 101

AbstractBufferedFile.fspath breaks dask's PyArrow.orc test

With fsspec master, the following dask test fails

$ pytest dask/dataframe/io/tests/test_orc.py::test_orc_with_backend

The pyarrow error message isn't really helpful. I've traced it down to AbstractBufferedFile.fspath. Removing that, the test passes.

diff --git a/fsspec/spec.py b/fsspec/spec.py
index 0b66f99..6567ad9 100644
--- a/fsspec/spec.py
+++ b/fsspec/spec.py
@@ -1162,10 +1162,6 @@ class AbstractBufferedFile(io.IOBase):
 
     def __str__(self):
         return "<File-like object %s, %s>" % (type(self.fs).__name__, self.path)
-
-    def __fspath__(self):
-        return self.fs.protocol + "://" + self.path
-
     __repr__ = __str__
 
     def __enter__(self):

Does fspath make sense on this object? Is it actually present on the filesystem, or just in memory?

xref dask/dask#5267

touchup

https://github.com/martindurant/filesystem_spec/blob/825c894d7634ae5546b15be60f5c583bc37edc5d/fsspec/spec.py#L28

Should just be cls()

Support glob paths

Several methods should be able to automatically support paths containing "*" (or lists or paths?), e.g., get/put, delete, copy and move. For methods with a target, target would be interpreted as a directory.

error when reading empty file using CachingFileSystem

example:

from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(target_protocol='s3')
fs.open('s3://bucket/empty')

{'fn': '01c8b40c487e3c79a7549715f6fff04a15e6b0017bcb83cb98d9278b0b4ed4ec', 'blocks': set()}
Traceback (most recent call last):
  File "/fsspec_test.py", line 3, in <module>
    fs.open('s3://bucket/empty')
  File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/implementations/cached.py", line 157, in <lambda>
    self, *args, **kw
  File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/spec.py", line 669, in open
    autocommit=ac, **kwargs)
  File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/implementations/cached.py", line 157, in <lambda>
    self, *args, **kw
  File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/implementations/cached.py", line 132, in _open
    fn, blocks)
  File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/core.py", line 397, in __init__
    self.cache = self._makefile()
  File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/core.py", line 409, in _makefile
    fd.seek(self.size - 1)
OSError: [Errno 22] Invalid argument

Add implementations?

Aside from the sibling projects (s3fs, gcsfs...) that this project is meant to unite, there are many other file-system-like things that would be easy to implement. Those things could live directly within this repo, as local, http and memory do, but would make setting up testing increasingly complicated.

(s)ftp
ssh/scp

Some are already implemented in pyfilsystems2 - can we wrap those or otherwise reuse the code?

possible use case for xmitgcm

We have an obscure package called xmitgcm which provides read-only access to the binary files generated by the mitgcm ocean model in xarray format. The raw data is usually all stored in a single directory, and is a combination of binary data files (possibly large) and text metadata files (always very small).

Currently, the package works only with files stored on disk. Interaction with the filesystem is basically limited to the following operations:

os.listdir, to get the files in a particular directory, plus pattern matching (we currently use glob, but it's not necessary)
open to read and parse metadata stored in text files
os.path.getsize to examine the size of binary objects
reading of binary data, as follows:
- file.seek to point to a specific byte range within the file
- np.fromfile to load binary data (optionally wrapped in a dask delayed call to make reading lazy). Sometimes invoked with the count argument to read only a specific byte range

We would like to refactor this package to work use a more generic idea of filesystems, with the goal of being able to upload the data "as is" directly into cloud storage buckets and have everything work the same.

Based on my limited understanding of cloud storage, this should be doable. The directory listing and text file reading is trivial. The reading of contiguous byte ranges from binary files should also be possible with range requests.

So the only question is, how do we implement this? We could create our own filesystem abstraction and then implement an on-disk class, an s3 class, an gcs class, etc. But I feel that this would be a waste of time, since these things are already done by s3fs, gcsfs, etc.

So I am here seeking a recommendation about the best way to approach this problem.

cc @raphaeldussin

Add installation instructions

pip and conda

Stripping of prefix for FSMap

FSMap.__init__ should probably strip the protocol / prefix of the root so that S3Map('s3://mybucket/file') is treated the same as S3Map('mybucket/file').

In [15]: s3 = s3fs.S3FileSystem()

In [16]: map1 = s3fs.S3Map('s3://a/b', s3)

In [17]: map2 = s3fs.S3Map('a/b', s3)

In [18]: assert map1.root == map2.root
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-18-f02e96dd7165> in <module>
----> 1 assert map1.root == map2.root

AssertionError:

New release

How about pushing a new release to pypi and/or conda-forge with the fixes in the hdfs branch?

fsspec / filesystem_spec Goto Github PK

filesystem_spec's People

Stargazers

Watchers

Forkers

filesystem_spec's Issues

Recommend Projects

Recommend Topics

Recommend Org