fsspec / gcsfs Goto Github PK

View Code? Open in Web Editor NEW

319.0 14.0 141.0 5.27 MB

Pythonic file-system interface for Google Cloud Storage

Home Page: http://gcsfs.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

hacktoberfest

gcsfs's Introduction

gcsfs

Pythonic file-system for Google Cloud Storage

For documentation, go to readthedocs.

gcsfs's People

Contributors

Stargazers

Watchers

Forkers

vincentschut martindurant jean-philippe-martin halfdanrump jhodges mrocklin rabernat asford kayibal bnaul datarevenue-berlin fjetter rsignell-usgs gridl ryan-williams lasersonlab vmois mwengren arpit1997 qibin-google davidbrochart javabrett cclauss pakelley jiajie-chen ecrows timothysu exp-time-series-tools rafa-guedes leg100 yokomotod tomaugspurger benjeffery willirath jacobtomlinson sdanzan jrbourbeau kkwekkeboom dldaisy helgridly bobingm remram44 masda70 vlasenkoalexey matthewtberry remerge spencerkclark cj-wright fagan2888 namanjain jacobhayes replicahq baxen chanyilin econtal dpgrev mikeharbrdata agrinh oliverwm1 ai2cm isidentical brl0 giuzzilla fnalsk harrydrippin max-sixty lazylynx maximilianeber slevang polyaxon mmourafiq stjordanis baszalmstra psychicbattle aditya787244poddar ashwini-analytics hercules261188 startlate40 awill1988 darune data-ai-ml-services v01dxyz reactor11 ryandv01 jonringer efiop sophiehallstedtqc maxwinterstein janjagusch jonashaag sooheon curry94-dev krfricke am-berry mehcan-dh machow shuuji3 noahgolmant calautogolf floosli

gcsfs's Issues

release?

We should release soon, but I think I would like to include batch-delete first (#69). Any other critical things that should be done?

release

A new release is due following recent activity.

Need to update the client id/secret used for browser-based authentication.

Consider including fuse (#53 ) and directory (#57 ) work.

cache block to reuse by readline

I have a file of 30M lines and know that each line is max 2000 bytes long.
So if I want to iterate through the "file" (blob), I can specify the length of bytes to fetch to limit waste but doing that 3M times would cost huge latency.

I wonder if it would not be better - through an option - to read and cache a full block and read the lines from there.

This can be done outside of GCSFS but thought it may be embedded.

Getting "The OAuth client was deleted" error on connection

In [1]: import gcsfs

In [2]: gcsfs.GCSFileSystem()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-d964e936fddf> in <module>()
----> 1 gcsfs.GCSFileSystem()

/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/gcsfs/core.py in __init__(self, project, access, token, block_size)
    158         self.access = access
    159         self.dirs = {}
--> 160         self.connect()
    161         self._singleton[0] = self
    162 

/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/gcsfs/core.py in connect(self, refresh)
    267                                 'refresh_token': data['refresh_token'],
    268                                 'grant_type': "refresh_token"})
--> 269                 validate_response(r, path)
    270                 data['timestamp'] = time.time()
    271                 data['access_token'] = r.json()['access_token']

/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/gcsfs/core.py in validate_response(r, path)
    104             raise ValueError("Bad Request: %s" % path)
    105         else:
--> 106             raise RuntimeError(msg)
    107 
    108 

RuntimeError: b'{\n "error": "deleted_client",\n "error_description": "The OAuth client was deleted."\n}\n'

Looking around on the internet, people seem to suggest that I enable OAuth on my account. This seems odd though because this didn't seem to be necessary before. Any thoughts on what might be going on?

Check file consistency on write

GCS provides size and md5 hash information on written files. At the minimum, the former should be checked against the expected upload size, and optionally the md5 also for a stronger guarantee.

Unsuccessful recovery from ConnectionError

I'm trying to push a large-ish dataset to GCS via xarray/dask/zarr/gcsfs. Things are generally working during the setup and for the first part of the upload. However, after a bit, I'm getting a ConnectionError that is not recoverable. I'm pushing from a server at the University of Washington to bucket at "US-CENTRAL1". I would image the network at UW is pretty stable.

Version info:

xarray: jhamman:fix/zarr_set_attrs
dask: 0.16.0
zarr: master
gcsfs: master

Details of the full traceback are in this gist: https://gist.github.com/jhamman/25ddda993ad5b768e4b8289904be6779

cc @mrocklin, @martindurant

xref: pangeo-data/pangeo#48, pydata/xarray#1800

service_account not accepted for token

Google returns ValueError: Only 'authorized_user' tokens accepted, got: service_account. Limitation or can that be extended ?

Note: @martindurant this is yet another great lib !! G only implemented cloudstorage for their GAE Standard. File-like objects are perfect to minimize memory usage and skip the need for disk.

Connect to GCSFS anonymously

I would like to set up GCSFuse in a docker container pointing to a read-only public bucket. This container will not have any Google authentication by default. Is it possible to create a GCSFileSystem without associating it to a google account and without using token='cloud'?

Small file upload fails

In [5]: import pandas as pd
   ...: import dask.dataframe as dd
   ...: dd.from_pandas(pd.DataFrame({'x': [1]}), npartitions=1).to_csv(f'gs://{bucket}/test.csv')
   ...:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/miniconda3/envs/model/lib/python3.6/site-packages/gcsfs/core.py in flush(self, force)
    905         if self.location is not None:
--> 906             self._upload_chunk(final=force)
    907         if force:

~/miniconda3/envs/model/lib/python3.6/site-packages/gcsfs/core.py in _upload_chunk(self, final)
    930             headers=head, data=data)
--> 931         validate_response(r, self.location)
    932         if 'Range' in r.headers:

~/miniconda3/envs/model/lib/python3.6/site-packages/gcsfs/core.py in validate_response(r, path)
    131         else:
--> 132             raise RuntimeError(m)
    133

RuntimeError: b"Invalid request.  The number of bytes uploaded is required to be equal or greater than 262144, except for the final request (it's recommended to be the exact multiple of 262144).  The received request contained 7 bytes, which does not meet this requirement."

The same behavior happens with a simple string write:

In [8]: with gcsfs.GCSFileSystem(project=project).open(f'{bucket}/test.txt', 'w') as f:
   ...:     f.write('test')
...
RuntimeError: b"Invalid request.  The number of bytes uploaded is required to be equal or greater than 262144, except for the final request (it's recommended to be the exact multiple of 262144).  The received request contained 4 bytes, which does not meet this requirement."

But weirdly enough, changing the write type to binary resolves this:

In [9]: with gcsfs.GCSFileSystem(project='model-streetlight-toronto').open('persona-upload/test.txt', 'wb') as f:
   ...:     f.write(b'test')
   ...:
In [10]:  # OK

I'm trying to remember off the top of my head if this is new behavior but I'm not sure.

gcsfs==0.0.4
google-cloud-storage==1.7.0

GCSMap fails when we don't have strong permissions

I'm trying to read a publicly readable bucket using the GCSMap. My account doesn't have much in the way of permissions, but the bucket is publicly readable.

fs = gcsfs.GCSFileSystem(token='cloud')
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newmann-met-ensemble', gcs=fs)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-1c1ffef3b0ee> in <module>()
      1 fs = gcsfs.GCSFileSystem(token='cloud')
----> 2 gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newmann-met-ensemble', gcs=fs)

/opt/conda/lib/python3.6/site-packages/gcsfs/mapping.py in __init__(self, root, gcs, check, create)
     41             if create:
     42                 self.gcs.mkdir(bucket)
---> 43             elif not self.gcs.exists(bucket):
     44                 raise ValueError("Bucket %s does not exist."
     45                         " Create bucket with the ``create=True`` keyword" %

/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in exists(self, path)
    457                 return bool(self.info(path))
    458             else:
--> 459                 return bucket in self.ls('')
    460         except FileNotFoundError:
    461             return False

/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in ls(self, path, detail)
    374     def ls(self, path, detail=False):
    375         if path in ['', '/']:
--> 376             out = self._list_buckets()
    377         else:
    378             bucket, prefix = split_path(path)

/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in _list_buckets(self)
    308     def _list_buckets(self):
    309         if '' not in self.dirs:
--> 310             out = self._call('get', 'b/', project=self.project)
    311             dirs = out.get('items', [])
    312             self.dirs[''] = dirs

/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    303         except ValueError:
    304             out = r.content
--> 305         validate_response(r, path)
    306         return out
    307 

/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in validate_response(r, path)
    107             raise IOError("Forbidden: %s\n%s" % (path, msg))
    108         elif "invalid" in m:
--> 109             raise ValueError("Bad Request: %s\n%s" % (path, msg))
    110         else:
    111             raise RuntimeError(m)

ValueError: Bad Request: b/
Invalid argument

Objects not processed

@asford https://github.com/dask/gcsfs/blob/master/gcsfs/core.py#L559 - should this not be moved to the line above like "items": [self._process_object(bucket, i) for i in items], ? As it is, the items are being processed and then ignored.

KeyError: 'refresh_token' when importing gcsfs

I am trying to use the latest gcsfs master on pangeo.pydata.org (see pangeo-data/pangeo#112 for context).

From within the notebook, I do

! pip install --upgrade --user git+https://github.com/dask/gcsfs.git

When I try import gcsfs I get the following traceback

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-3f25f74e3f1b> in <module>()
----> 1 import gcsfs

~/.local/lib/python3.6/site-packages/gcsfs/__init__.py in <module>()
      1 from __future__ import absolute_import
      2 
----> 3 from .core import GCSFileSystem
      4 from .dask_link import register as register_dask
      5 from .mapping import GCSMap

~/.local/lib/python3.6/site-packages/gcsfs/core.py in <module>()
    985 
    986 
--> 987 GCSFileSystem.load_tokens()
    988 
    989 

~/.local/lib/python3.6/site-packages/gcsfs/core.py in load_tokens()
    307             tokens = {k: (GCSFileSystem._dict_to_credentials(v)
    308                           if isinstance(v, dict) else v)
--> 309                       for k, v in tokens.items()}
    310         except IOError:
    311             tokens = {}

~/.local/lib/python3.6/site-packages/gcsfs/core.py in <dictcomp>(.0)
    307             tokens = {k: (GCSFileSystem._dict_to_credentials(v)
    308                           if isinstance(v, dict) else v)
--> 309                       for k, v in tokens.items()}
    310         except IOError:
    311             tokens = {}

~/.local/lib/python3.6/site-packages/gcsfs/core.py in _dict_to_credentials(token)
    335         """
    336         return Credentials(
--> 337             None, refresh_token=token['refresh_token'],
    338             client_secret=token['client_secret'],
    339             client_id=token['client_id'],

KeyError: 'refresh_token'

serialization

If gcsfs is to be used in the context of dask, it should not pass around the authorization token when pickled. Instead, we should require the user to have distributed a token file to each machine beforehand or otherwise set up permissions on each node, as we do for s3fs.

invalid_scope: Empty or missing scope not allowed

After overcoming #82, I am on to a new error.

I set my GOOGLE_APPLICATION_CREDENTIALS to point to an appropriate .json file. I verified that the config is working with the following code

In [10]: from google.cloud import storage

In [11]: storage_client = storage.Client()

In [12]: list(storage_client.list_buckets())
Out[12]:
[<Bucket: pangeo>,
 <Bucket: pangeo-data>,
 <Bucket: pangeo-data-private>,
 <Bucket: zarr_store_test>]

Now I am trying to do the same thing from gcsfs. I get an error invalid_scope: Empty or missing scope not allowed.

In [8]: fs = gcsfs.GCSFileSystem()

In [9]: fs.buckets
DEBUG:gcsfs.core:_list_buckets(args=(), kwargs={})
DEBUG:gcsfs.core:_call(args=('get', 'b/'), kwargs={'project': 'pangeo-181919'})
ERROR:gcsfs.core:_call exception: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')
Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
    r = meth(self.base + path, params=kwargs, json=json)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
    self._auth_request, method, url, request_headers)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py", line 322, in refresh
    request, self._token_uri, assertion)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 145, in jwt_grant
    response_data = _token_endpoint_request(request, token_uri, body)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 111, in _token_endpoint_request
    _handle_error_response(response_body)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 61, in _handle_error_response
    error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')
ERROR:gcsfs.core:_call exception: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')
Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
    r = meth(self.base + path, params=kwargs, json=json)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
    self._auth_request, method, url, request_headers)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py", line 322, in refresh
    request, self._token_uri, assertion)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 145, in jwt_grant
    response_data = _token_endpoint_request(request, token_uri, body)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 111, in _token_endpoint_request
    _handle_error_response(response_body)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 61, in _handle_error_response
    error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')
ERROR:gcsfs.core:_call exception: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')
Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
    r = meth(self.base + path, params=kwargs, json=json)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
    self._auth_request, method, url, request_headers)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py", line 322, in refresh
    request, self._token_uri, assertion)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 145, in jwt_grant
    response_data = _token_endpoint_request(request, token_uri, body)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 111, in _token_endpoint_request
    _handle_error_response(response_body)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 61, in _handle_error_response
    error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')
ERROR:gcsfs.core:_call exception: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')
Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
    r = meth(self.base + path, params=kwargs, json=json)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
    self._auth_request, method, url, request_headers)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py", line 322, in refresh
    request, self._token_uri, assertion)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 145, in jwt_grant
    response_data = _token_endpoint_request(request, token_uri, body)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 111, in _token_endpoint_request
    _handle_error_response(response_body)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 61, in _handle_error_response
    error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')
---------------------------------------------------------------------------
RefreshError                              Traceback (most recent call last)
<ipython-input-9-bb76ad6e7216> in <module>()
----> 1 fs.buckets

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in buckets(self)
    449     def buckets(self):
    450         """Return list of available project buckets."""
--> 451         return [b["name"] for b in self._list_buckets()["items"]]
    452
    453     @classmethod

<decorator-gen-122> in _list_buckets(self)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50
---> 51     return f(self, *args, **kwargs)
     52
     53

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _list_buckets(self)
    568         items = []
    569         page = self._call(
--> 570             'get', 'b/', project=self.project
    571         )
    572

<decorator-gen-117> in _call(self, method, path, *args, **kwargs)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50
---> 51     return f(self, *args, **kwargs)
     52
     53

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    435                 logger.exception("_call exception: %s", e)
    436                 if retry == self.retries - 1:
--> 437                     raise e
    438                 if is_retriable(e):
    439                     # retry

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    429             try:
    430                 time.sleep(2**retry - 1)
--> 431                 r = meth(self.base + path, params=kwargs, json=json)
    432                 validate_response(r, path)
    433                 break

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in get(self, url, **kwargs)
    519
    520         kwargs.setdefault('allow_redirects', True)
--> 521         return self.request('GET', url, **kwargs)
    522
    523     def options(self, url, **kwargs):

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py in request(self, method, url, data, headers, **kwargs)
    196
    197         self.credentials.before_request(
--> 198             self._auth_request, method, url, request_headers)
    199
    200         response = super(AuthorizedSession, self).request(

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py in before_request(self, request, method, url, headers)
    119         # the http request.)
    120         if not self.valid:
--> 121             self.refresh(request)
    122         self.apply(headers)
    123

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py in refresh(self, request)
    320         assertion = self._make_authorization_grant_assertion()
    321         access_token, expiry, _ = _client.jwt_grant(
--> 322             request, self._token_uri, assertion)
    323         self.token = access_token
    324         self.expiry = expiry

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py in jwt_grant(request, token_uri, assertion)
    143     }
    144
--> 145     response_data = _token_endpoint_request(request, token_uri, body)
    146
    147     try:

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py in _token_endpoint_request(request, token_uri, body)
    109
    110     if response.status != http_client.OK:
--> 111         _handle_error_response(response_body)
    112
    113     response_data = json.loads(response_body)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py in _handle_error_response(response_body)
     59
     60     raise exceptions.RefreshError(
---> 61         error_details, response_body)
     62
     63

RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n  "error" : "invalid_scope",\n  "error_description" : "Empty or missing scope not allowed."\n}')

I am using the latest gcsfs master, installed from pip + git.

Support default credentials

It'd be nice to be able to use this library just using gcloud's default credentials, instead of having to juggle files.

Use batch requests for delete

Although there is no batch delete as such, there is such a concept as batch request in GCS: https://cloud.google.com/storage/docs/json_api/v1/how-tos/batch

credentials error with distributed

I am trying to use gcsfs via distributed in pangeo-data/pangeo#150. I have uncovered what seems like a serialization bug.

This works from my notebook (the token appears to be cached):

fs = gcsfs.GCSFileSystem(project='pangeo-181919')
fs.buckets

It returns the four buckets: ['pangeo', 'pangeo-data', 'pangeo-data-private', 'zarr_store_test'].

Now I create a distributed cluster and client and use it to run the same command:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(processes=False)
client = Client(cluster)
client.run(lambda : fs.buckets)

I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-3de328a517c7> in <module>()
----> 1 client.run(lambda : fs.buckets)

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py in run(self, function, *args, **kwargs)
   1906          '192.168.0.101:9000': 'running}
   1907         """
-> 1908         return self.sync(self._run, function, *args, **kwargs)
   1909 
   1910     @gen.coroutine

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    601             return future
    602         else:
--> 603             return sync(self.loop, func, *args, **kwargs)
    604 
    605     def __repr__(self):

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    251             e.wait(10)
    252     if error[0]:
--> 253         six.reraise(*error[0])
    254     else:
    255         return result[0]

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/utils.py in f()
    235             yield gen.moment
    236             thread_state.asynchronous = True
--> 237             result[0] = yield make_coro()
    238         except Exception as exc:
    239             logger.exception(exc)

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/util.py in raise_exc_info(exc_info)

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py in run(self)
   1067                             exc_info = None
   1068                     else:
-> 1069                         yielded = self.gen.send(value)
   1070 
   1071                     if stack_context._state.contexts is not orig_stack_contexts:

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py in _run(self, function, *args, **kwargs)
   1860                 results[key] = resp['result']
   1861             elif resp['status'] == 'error':
-> 1862                 six.reraise(*clean_exception(**resp))
   1863         raise gen.Return(results)
   1864 

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/six.py in reraise(tp, value, tb)
    690                 value = tp()
    691             if value.__traceback__ is not tb:
--> 692                 raise value.with_traceback(tb)
    693             raise value
    694         finally:

<ipython-input-25-3de328a517c7> in <lambda>()
----> 1 client.run(lambda : fs.buckets)

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in buckets()
    449     def buckets(self):
    450         """Return list of available project buckets."""
--> 451         return [b["name"] for b in self._list_buckets()["items"]]
    452 
    453     @classmethod

<decorator-gen-128> in _list_buckets()

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _tracemethod()
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50 
---> 51     return f(self, *args, **kwargs)
     52 
     53 

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _list_buckets()
    568         items = []
    569         page = self._call(
--> 570             'get', 'b/', project=self.project
    571         )
    572 

<decorator-gen-123> in _call()

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _tracemethod()
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50 
---> 51     return f(self, *args, **kwargs)
     52 
     53 

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _call()
    429             try:
    430                 time.sleep(2**retry - 1)
--> 431                 r = meth(self.base + path, params=kwargs, json=json)
    432                 validate_response(r, path)
    433                 break

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/requests/sessions.py in get()
    519 
    520         kwargs.setdefault('allow_redirects', True)
--> 521         return self.request('GET', url, **kwargs)
    522 
    523     def options(self, url, **kwargs):

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/google/auth/transport/requests.py in request()
    195         request_headers = headers.copy() if headers is not None else {}
    196 
--> 197         self.credentials.before_request(
    198             self._auth_request, method, url, request_headers)
    199 

AttributeError: 'AuthorizedSession' object has no attribute 'credentials'

Uninformative error with invalid / not set credentials

I am trying to use gcsfs to access gcs from the NASA pleaides supercomputer.

I tried to initialize a gcsfs.mapping before having my credentials set up properly (although I thought my default credentials were valid.)

I was able to get past this error by setting GOOGLE_APPLICATION_CREDENTIALS environment variable as described in the google cloud storage docs.

However, the info I got from gcsfs was now very helpful.

In [1]: import logging
   ...: logging.basicConfig(
   ...:     format="%(created)0.3f %(levelname)s %(name)s %(message)s",
   ...:     level=logging.INFO)
   ...: logging.getLogger("gcsfs.core").setLevel(logging.DEBUG)
   ...: logging.getLogger("gcsfs.gcsfs").setLevel(logging.DEBUG)

In [2]: import gcsfs

In [3]: fs = gcsfs.GCSFileSystem(project='pangeo-181919')
1519678908.044 INFO google.auth.compute_engine._metadata Compute Engine Metadata server unavailable.
1519678908.044 DEBUG gcsfs.core Connection with method "google_default" failed

In [4]: gcsmap = gcsfs.mapping.GCSMap('pangeo-data', gcs=fs)
1519679002.503 DEBUG gcsfs.core exists(args=('pangeo-data',), kwargs={})
1519679002.503 DEBUG gcsfs.core _list_buckets(args=(), kwargs={})
1519679002.503 DEBUG gcsfs.core _call(args=('get', 'b/'), kwargs={'project': 'pangeo-181919'})
1519679002.519 ERROR gcsfs.core _call exception: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
    conn = self._new_conn()
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 120, in __call__
    **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 90, in refresh
    self._retrieve_info(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 72, in _retrieve_info
    service_account=self._service_account_email)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
    recursive=True)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
    response = request(url=url, method='GET', headers=_METADATA_HEADERS)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 124, in __call__
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
    r = meth(self.base + path, params=kwargs, json=json)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
    self._auth_request, method, url, request_headers)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))
1519679004.039 ERROR gcsfs.core _call exception: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
    conn = self._new_conn()
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 120, in __call__
    **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 90, in refresh
    self._retrieve_info(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 72, in _retrieve_info
    service_account=self._service_account_email)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
    recursive=True)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
    response = request(url=url, method='GET', headers=_METADATA_HEADERS)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 124, in __call__
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
    r = meth(self.base + path, params=kwargs, json=json)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
    self._auth_request, method, url, request_headers)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))
1519679007.052 ERROR gcsfs.core _call exception: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
    conn = self._new_conn()
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 120, in __call__
    **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 90, in refresh
    self._retrieve_info(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 72, in _retrieve_info
    service_account=self._service_account_email)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
    recursive=True)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
    response = request(url=url, method='GET', headers=_METADATA_HEADERS)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 124, in __call__
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
    r = meth(self.base + path, params=kwargs, json=json)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
    self._auth_request, method, url, request_headers)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))
1519679016.526 ERROR gcsfs.core _call exception: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
    conn = self._new_conn()
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 120, in __call__
    **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 90, in refresh
    self._retrieve_info(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 72, in _retrieve_info
    service_account=self._service_account_email)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
    recursive=True)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
    response = request(url=url, method='GET', headers=_METADATA_HEADERS)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 124, in __call__
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
    r = meth(self.base + path, params=kwargs, json=json)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
    self._auth_request, method, url, request_headers)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py in _new_conn(self)
    140             conn = connection.create_connection(
--> 141                 (self.host, self.port), self.timeout, **extra_kw)
    142 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
     59 
---> 60     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
     61         af, socktype, proto, canonname, sa = res

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    744     addrlist = []
--> 745     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    746         af, socktype, proto, canonname, sa = res

gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    600                                                   body=body, headers=headers,
--> 601                                                   chunked=chunked)
    602 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    356         else:
--> 357             conn.request(method, url, **httplib_request_kw)
    358 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1238         """Send a complete request to the server."""
-> 1239         self._send_request(method, url, body, headers, encode_chunked)
   1240 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1284             body = _encode(body, 'body')
-> 1285         self.endheaders(body, encode_chunked=encode_chunked)
   1286 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in endheaders(self, message_body, encode_chunked)
   1233             raise CannotSendHeader()
-> 1234         self._send_output(message_body, encode_chunked=encode_chunked)
   1235 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in _send_output(self, message_body, encode_chunked)
   1025         del self._buffer[:]
-> 1026         self.send(msg)
   1027 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in send(self, data)
    963             if self.auto_open:
--> 964                 self.connect()
    965             else:

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py in connect(self)
    165     def connect(self):
--> 166         conn = self._new_conn()
    167         self._prepare_conn(conn)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py in _new_conn(self)
    149             raise NewConnectionError(
--> 150                 self, "Failed to establish a new connection: %s" % e)
    151 

NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    439                     retries=self.max_retries,
--> 440                     timeout=timeout
    441                 )

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    638             retries = retries.increment(method, url, error=e, _pool=self,
--> 639                                         _stacktrace=sys.exc_info()[2])
    640             retries.sleep()

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    387         if new_retry.is_exhausted():
--> 388             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    389 

MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py in __call__(self, url, method, body, headers, timeout, **kwargs)
    119                 method, url, data=body, headers=headers, timeout=timeout,
--> 120                 **kwargs)
    121             return _Response(response)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    507         send_kwargs.update(settings)
--> 508         resp = self.send(prep, **send_kwargs)
    509 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
    617         # Send the request
--> 618         r = adapter.send(request, **kwargs)
    619 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    507 
--> 508             raise ConnectionError(e, request=request)
    509 

ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

TransportError                            Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py in refresh(self, request)
     89         try:
---> 90             self._retrieve_info(request)
     91             self.token, self.expiry = _metadata.get_service_account_token(

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py in _retrieve_info(self, request)
     71             request,
---> 72             service_account=self._service_account_email)
     73 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py in get_service_account_info(request, service_account)
    178         'instance/service-accounts/{0}/'.format(service_account),
--> 179         recursive=True)
    180 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py in get(request, path, root, recursive)
    114 
--> 115     response = request(url=url, method='GET', headers=_METADATA_HEADERS)
    116 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py in __call__(self, url, method, body, headers, timeout, **kwargs)
    123             new_exc = exceptions.TransportError(caught_exc)
--> 124             six.raise_from(new_exc, caught_exc)
    125 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/six.py in raise_from(value, from_value)

TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))

The above exception was the direct cause of the following exception:

RefreshError                              Traceback (most recent call last)
<ipython-input-4-8e538214a4de> in <module>()
----> 1 gcsmap = gcsfs.mapping.GCSMap('pangeo-data', gcs=fs)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/mapping.py in __init__(self, root, gcs, check, create)
     41             if create:
     42                 self.gcs.mkdir(bucket)
---> 43             elif not self.gcs.exists(bucket):
     44                 raise ValueError("Bucket %s does not exist."
     45                         " Create bucket with the ``create=True`` keyword" %

<decorator-gen-131> in exists(self, path)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50 
---> 51     return f(self, *args, **kwargs)
     52 
     53 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in exists(self, path)
    768                 return bool(self.info(path))
    769             else:
--> 770                 if bucket in self.buckets:
    771                     return True
    772                 else:

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in buckets(self)
    449     def buckets(self):
    450         """Return list of available project buckets."""
--> 451         return [b["name"] for b in self._list_buckets()["items"]]
    452 
    453     @classmethod

<decorator-gen-122> in _list_buckets(self)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50 
---> 51     return f(self, *args, **kwargs)
     52 
     53 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _list_buckets(self)
    568         items = []
    569         page = self._call(
--> 570             'get', 'b/', project=self.project
    571         )
    572 

<decorator-gen-117> in _call(self, method, path, *args, **kwargs)

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50 
---> 51     return f(self, *args, **kwargs)
     52 
     53 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    435                 logger.exception("_call exception: %s", e)
    436                 if retry == self.retries - 1:
--> 437                     raise e
    438                 if is_retriable(e):
    439                     # retry

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    429             try:
    430                 time.sleep(2**retry - 1)
--> 431                 r = meth(self.base + path, params=kwargs, json=json)
    432                 validate_response(r, path)
    433                 break

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in get(self, url, **kwargs)
    519 
    520         kwargs.setdefault('allow_redirects', True)
--> 521         return self.request('GET', url, **kwargs)
    522 
    523     def options(self, url, **kwargs):

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py in request(self, method, url, data, headers, **kwargs)
    196 
    197         self.credentials.before_request(
--> 198             self._auth_request, method, url, request_headers)
    199 
    200         response = super(AuthorizedSession, self).request(

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py in before_request(self, request, method, url, headers)
    119         # the http request.)
    120         if not self.valid:
--> 121             self.refresh(request)
    122         self.apply(headers)
    123 

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py in refresh(self, request)
     94         except exceptions.TransportError as caught_exc:
     95             new_exc = exceptions.RefreshError(caught_exc)
---> 96             six.raise_from(new_exc, caught_exc)
     97 
     98     @property

/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/six.py in raise_from(value, from_value)

RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Connection broken: error 104, ECONNRESET

I used gcsfs to read about 800 files and it failed for 3 of them with this error:

Connection broken: error("(104, \'ECONNRESET\')",)

The call stack looks like this:

  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 536, in open
    return GCSFile(self, path, mode, block_size)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 622, in __init__
    self.details = gcsfs.info(path)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 464, in info
    files = self.ls(path, True)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 379, in ls
    files = self._list_bucket(bucket)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 320, in _list_bucket
    out = self._call('get', 'b/{}/o/', bucket, maxResults=max_results, pageToken=next_page_token)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 299, in _call
    json=json)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 658, in send
    r.content
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 823, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 748, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: error("(104, \'ECONNRESET\')",)', error("(104, 'ECONNRESET')",))

From the outside it looks like sometimes GCS has some sort of transient errors and GCSFS ought to be able to survive those and retry.

Avoid downloading list of files in bucket when object path contains no wildcard

Original discussion here

Quick summary: When reading a single file from a bucket containing a very large number of files, the read time can be very long before gcsfs downloads the list of all the files in the bucket. For a bucket with over 4 million files, this can take up to 15 minutes.

Suggested fix (easy): Check if the object path contains a wildcard. If it does not, then just download file. If it does, then proceed to download list of files.

Suggested fix (hard, not sure how to do yet, suggested by @martindurant ): implement prefix/delimiter listing, i.e., make it possible to use wildcard in object path (.e.g., bag.read_text('gs://mybucket/2017/01/*.csv')) without having to download list of files from entire bucket.

service account permissions error

I am trying to use gcsfs with a GCS service account .json token. I created a token at https://console.cloud.google.com/iam-admin/serviceaccounts/ and assigned it the role of "Storage Admin". This should have permissions to do anything to my GCS resources. I downloaded the .json token.

I use this with gcsfs as follows:

fs = gcsfs.GCSFileSystem(project='pangeo-181919', token='/home/rpa/pangeo-bf62fe06ed97.json')
fs.buckets

I get this error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-14-81fa34b27c7f> in <module>()
      1 # connect to gcs
      2 fs = gcsfs.GCSFileSystem(project='pangeo-181919', token='/home/rpa/pangeo-bf62fe06ed97.json')
----> 3 fs.buckets

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in buckets(self)
    449     def buckets(self):
    450         """Return list of available project buckets."""
--> 451         return [b["name"] for b in self._list_buckets()["items"]]
    452 
    453     @classmethod

<decorator-gen-128> in _list_buckets(self)

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50 
---> 51     return f(self, *args, **kwargs)
     52 
     53 

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _list_buckets(self)
    568         items = []
    569         page = self._call(
--> 570             'get', 'b/', project=self.project
    571         )
    572 

<decorator-gen-123> in _call(self, method, path, *args, **kwargs)

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     50 
---> 51     return f(self, *args, **kwargs)
     52 
     53 

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    430                 time.sleep(2**retry - 1)
    431                 r = meth(self.base + path, params=kwargs, json=json)
--> 432                 validate_response(r, path)
    433                 break
    434             except (HtmlError, RequestException, GoogleAuthError) as e:

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in validate_response(r, path)
    156             raise FileNotFoundError(path)
    157         elif "forbidden" in m:
--> 158             raise IOError("Forbidden: %s\n%s" % (path, msg))
    159         elif "invalid" in m:
    160             raise ValueError("Bad Request: %s\n%s" % (path, msg))

OSError: Forbidden: b/
[email protected] does not have storage.buckets.list access to project 464800473488.

This doesn't make sense to me. It seems like listing buckets should definitely be within the privileges of the storage admin.

gcsfuse stalls?

The following causes both terminals to hang in such a way that they can not be interrupted

$ pip install gcsfs --upgrade
$ mkdir gcs
$ gcsfuse pangeo-data gcs
Mounting bucket pangeo-data to directory gcs

$ ls gcs

Use versioneer

When using gcsfs in master it is difficult to detect what commit I'm on. Any thoughts on switching to using versioneer for versions? Dask and distributed both do this and could serve as models.

cc @martindurant @rabernat

GCSFileSystem.exists() not working with bucket only

when checking on bucket existence, I get

ValueError: Bad Request: b/
Unknown project id: 0

Issue is actually with _list_buckets() and the get call to list buckets. The API requires the pass the project ID, not the project name - see https://cloud.google.com/storage/docs/json_api/v1/buckets/list and https://cloud.google.com/storage/docs/projects

BTW, that ID is also required to create a bucket so mkdir() has same issue

This was not spotted before if you happened to have project name = ID

`oauth2client` is deprecated.

Ancillary issue that would likely be linked to #59

As per the module documentation oauth2client is officially deprecated in favor of google-auth. As google-auth is an officially support project with a stable API, any further development of the gcsfs authorization system should target this library.

The current oauth implementation should likely replaced via a standard implementation using google-auth for token managment and the related google-auth-oauthlib.flow for device authorization.

invalid Credentials thrown after a while

Sorry for logging another one !

This one seems related to the cached credentials.

  File "g.py", line 61, in <module>
    recs += 1
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 896, in __exit__
    self.close()
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 868, in close
    self.flush(force=True)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 750, in flush
    self._simple_upload()
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 810, in _simple_upload
    validate_response(r, path)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 111, in validate_response
    raise RuntimeError(m)
RuntimeError: {
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "authError",
    "message": "Invalid Credentials",
    "locationType": "header",
    "location": "Authorization"
   }
  ],
  "code": 401,
  "message": "Invalid Credentials"
 }
}

I wonder if increasing the block size to say 50MB would not reduce the likehood of these error (overload, 'corrupted' credentials,...)

NameError: name 'DEBUG' is not defined

Still having problems on pangeo.pydata.org using the latest gcsfs installed this way from my notebook.

! pip install --upgrade --user git+https://github.com/dask/gcsfs.git

From the xarray-data example notebook.

import xarray as xr
import gcsfs
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newman-met-ensemble')
ds = xr.open_zarr(gcsmap)

produces this error

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-6411cc50b0ab> in <module>()
      6 import gcsfs
      7 gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newman-met-ensemble')
----> 8 ds = xr.open_zarr(gcsmap)
      9 
     10 # Print dataset

/opt/conda/lib/python3.6/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, auto_chunk, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables)
    470     zarr_store = ZarrStore.open_group(store, mode=mode,
    471                                       synchronizer=synchronizer,
--> 472                                       group=group)
    473     ds = maybe_decode_store(zarr_store)
    474 

/opt/conda/lib/python3.6/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, writer)
    276         import zarr
    277         zarr_group = zarr.open_group(store=store, mode=mode,
--> 278                                      synchronizer=synchronizer, path=group)
    279         return cls(zarr_group, writer=writer)
    280 

/opt/conda/lib/python3.6/site-packages/zarr/hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path)
   1108 
   1109     if mode in ['r', 'r+']:
-> 1110         if contains_array(store, path=path):
   1111             err_contains_array(path)
   1112         elif not contains_group(store, path=path):

/opt/conda/lib/python3.6/site-packages/zarr/storage.py in contains_array(store, path)
     70     prefix = _path_to_prefix(path)
     71     key = prefix + array_meta_key
---> 72     return key in store
     73 
     74 

~/.local/lib/python3.6/site-packages/gcsfs/mapping.py in __contains__(self, key)
     86 
     87     def __contains__(self, key):
---> 88         return self.gcs.exists(self._key_to_str(key))
     89 
     90     def __len__(self):

<decorator-gen-137> in exists(self, path)

~/.local/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     51        logger.log(logging.DEBUG - 1, tb_io.getvalue())
     52 
---> 53    return f(self, *args, **kwargs)
     54 
     55 # client created 23-Sept-2017

~/.local/lib/python3.6/site-packages/gcsfs/core.py in exists(self, path)
    771         try:
    772             if key:
--> 773                 return bool(self.info(path))
    774             else:
    775                 if bucket in self.buckets:

<decorator-gen-138> in info(self, path)

~/.local/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     51        logger.log(logging.DEBUG - 1, tb_io.getvalue())
     52 
---> 53    return f(self, *args, **kwargs)
     54 
     55 # client created 23-Sept-2017

~/.local/lib/python3.6/site-packages/gcsfs/core.py in info(self, path)
    803 
    804         try:
--> 805             return self._get_object(path)
    806         except FileNotFoundError:
    807             logger.debug("info FileNotFound at path: %s", path)

<decorator-gen-126> in _get_object(self, path)

~/.local/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     51        logger.log(logging.DEBUG - 1, tb_io.getvalue())
     52 
---> 53    return f(self, *args, **kwargs)
     54 
     55 # client created 23-Sept-2017

~/.local/lib/python3.6/site-packages/gcsfs/core.py in _get_object(self, path)
    485             raise FileNotFoundError(path)
    486 
--> 487         result = self._process_object(bucket, self._call('get', 'b/{}/o/{}', bucket, key))
    488 
    489         logger.debug("_get_object result: %s", result)

~/.local/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    434                 time.sleep(2**retry - 1)
    435                 r = meth(self.base + path, params=kwargs, json=json)
--> 436                 validate_response(r, path)
    437                 break
    438             except (HtmlError, RequestException, GoogleAuthError) as e:

~/.local/lib/python3.6/site-packages/gcsfs/core.py in validate_response(r, path)
    154             msg = str(r.content)
    155 
--> 156         if DEBUG:
    157             print(r.url, r.headers, sep='\n')
    158         if "Not Found" in m:

NameError: name 'DEBUG' is not defined

change `exists()` on bucket to fallback to `ls` if cannot list buckets

In anonymous mode, or when accessing a bucket not included in our project, it will not appear in the buckets list (or, indeed, listing buckets may not be possible). In this case, exists() on a bucket should instead attempt to list the bucket and return True if that succeeds.

Release on pypi

Is it possible to make one release of the library on pypi?

Error: Force flush cannot be called more than once

My script runs fine remotely - from laptop - but when running on a GCE instance, I get the following error:

  File "g.py", line 56, in <module>
    recs += 1
  File "xxx/gcsfs/core.py", line 896, in __exit__
    self.close()
  File "xxx/gcsfs/core.py", line 868, in close
    self.flush(force=True)
  File "xxx/gcsfs/core.py", line 754, in flush
    self._initiate_upload()
  File "xxx/gcsfs/core.py", line 798, in _initiate_upload
    self.location = r.headers['Location']
  File "xxx/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'location'
Exception ValueError: ValueError('Force flush cannot be called more than once',) in <bound method GCSFile.__del__ of <GCSFile testfile>> ignored

Code snippet is:

...
            with fs.open(seg_filename, 'wb') as seg_f:
                recs = 0
                while line.startswith(key):
                    seg_f.write(line[line_offset:])
                    line = f.readline()
                    recs += 1
            logger.info('Done writing  {}: {:,} records'.format(seg_filename, recs))

Unrelated but I also get a warning No module named dask.bytes.core

does not handle 502 errors (returned by GCS when server is overloaded)

I have faced several times the same 502 error when writing in chunks. Errors are sporadic and look like:

  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 896, in __exit__
    self.close()
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 868, in close
    self.flush(force=True)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 755, in flush
    self._upload_chunk(final=force)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 777, in _upload_chunk
    validate_response(r, self.location)
  File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 111, in validate_response
    raise RuntimeError(m)
RuntimeError: <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That's an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That's all we know.</ins>

Exception ValueError: ValueError('Force flush cannot be called more than once',) in <bound method GCSFile.__del__ of <GCSFile tempfile.txt>> ignored

Google writes this may happen when system/network is under stress - our script does indeed read and write from GCS several GB of files in chunks after on-the-fly transformations.
Look at Handling Errors at https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload.
They indicate to implement exponential-backoff: https://cloud.google.com/storage/docs/exponential-backoff

Is this something that can be implemented in GCSFS i.e. resubmit a chunk when such 5xx error occurs ?

If not, wrapping the write statement in a try/exception, delete the target file written so far and reprocess is not an option for me. Because the records I write are based off data in a compressed GCSFS-file I read. ZipFileExt are not seekable so what is read is gone. I would have to delete the whole batch and restart.

is this already useable as zarr backend?

more a question than an issue: I found this while looking for a s3fs (dask/s3fs) alternative to be used as zarr backend. We are looking to make our processing less dependend on AWS. Would you say gcsfs is already useable as a s3fs replacement for zarr?

(edit: sorry for the confusion, I typed 's3ql' instead of 's3fs' the first time. Corrected now.)

Walk method doesn't match python standard lib interface

The implementation of walk in gcsfs returns a flat list of all paths encountered when traversing the tree. In contrast, os.walk from the standard library returns an iterator of dirname, dirs, files, where dirname is the current directory prefix in the traversal, and files and dirs are the names (relative, not absolute) of all files and directories respectively.

This difference in behavior is confusing for those familiar with the standard os.walk method, and also makes providing the interface used by pyarrow (which does match the standard library) tricky.

Dependencies missing in conda recipe

@quasiben and I conda-installed gcsfs in a fresh environment. We found that we needed dask-core and toolz for things to function properly.

no error on GSCFile if bucket does not exist

When bucket in passed path does not exists, GSCFile fails on close (or when buffer is flushed), with a generic error like
Exception KeyError: ('location',) in <bound method GCSFile.__del__ of <GCSFile new_bucket/test>> ignored

GSCFile.__init__ should check on bucket existence and either create (preferred) or explicitly error out e.g. Bucket xxx does not exist
First option preferred to minimize pre-processing but requires parameters for storage class & location. Although could be defined as defaults.

Btw - do not use paranthesis in folder name, you cannot browse it in Google Console (bug reported)

RuntimeError: Waited too long for browserauthentication.

I am trying to create a new GCSFileSystem with the command

fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)

This triggers my browser to open and I have to enter the code. But I can never seem to finish it fast enough before I get the following error:

---------------------------------------------------------------------------
ApplicationDefaultCredentialsError        Traceback (most recent call last)
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/gcsfs/core.py in connect(self, refresh)
    241             try:
--> 242                 data = self.get_default_gtoken()
    243                 self.tokens[(project, access)] = data

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/gcsfs/core.py in get_default_gtoken()
    303     def get_default_gtoken():
--> 304         au = oauth2client.client.GoogleCredentials.get_application_default()
    305         tok = {"client_id": au.client_id, "client_secret": au.client_secret,

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/oauth2client/client.py in get_application_default()
   1270         """
-> 1271         return GoogleCredentials._get_implicit_credentials()
   1272 

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/oauth2client/client.py in _get_implicit_credentials(cls)
   1260         # If no credentials, fail.
-> 1261         raise ApplicationDefaultCredentialsError(ADC_HELP_MSG)
   1262 

ApplicationDefaultCredentialsError: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-2-f5bb2896043c> in <module>()
----> 1 fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/gcsfs/core.py in __init__(self, project, access, token, block_size, consistency)
    178         self.consistency = consistency
    179         self.dirs = {}
--> 180         self.connect()
    181         self._singleton[0] = self
    182 

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/gcsfs/core.py in connect(self, refresh)
    267                     if 'error' in data2:
    268                         if i == self.retries - 1:
--> 269                             raise RuntimeError("Waited too long for browser"
    270                                                "authentication.")
    271                         continue

RuntimeError: Waited too long for browserauthentication.

`google.auth.exceptions.RefreshError` with excessive concurrent requests.

gcsfs propagates an google.auth.exceptions.RefreshError when executing many concurrent requests from a single node using the google_default credentials class. This is likely due to repeated, excessive number of requests to the internal metadata service. This is a known bug of the external library at googleapis/google-auth-library-python#211.

Anecdotally, I've primarily observed this in dask.distributed workers and believe this might occur due to the way GCSFiles are distributed. This primarily occurs when a large number of small files are being read from storage and many worker threads are performing concurrent reads. I believe the GCSFiles serialized in dask tasks then each instantiate a separate GCSFilesystem, resolve credentials and open a session.

If this is the case it would be preferable to store a fixed set of AuthenticatedSession handles, ideally via cache on the GCSFilesystem class, and dispatch to an auth-method-specific session in the GCSFilesystem._connect_* connection functions.

As a more specific solution, google.auth.exceptions.RefreshError or its base class should be added to the retrying exception list in _call, however this may mask legitimate authentication errors. The credentials should probably be "tested" via some call that does not retry this error during session initialization. This may be as simple as calling session.credentials.refresh or performing a single authenticated request after session initialization.

Feature Request: Support Text IO

It would be valuable to support the standard Text I/O interface in gcsfs for use as a standalone, file-like library. The current behavior requires binary IO, which is not directly compatible with some standard tools (eg. json).

For example:

import json
import gcsfs

fs = gcsfs.GCSFileSystem(...)
an_obj = {"foo" : "bar"}

with fs.open("bucket/path", "w") as of:
    json.dump(an_obj, of)

raises NotImplementedError.

While it's possible to workaround via io.TextIOWrapper, it would be best to expose a GCSFile in text-mode in order to provide support for gcsfs-specific api extensions on the file object.

import json
import gcsfs

fs = gcsfs.GCSFileSystem(...)
an_obj = {"foo" : "bar"}

with fs.open("bucket/path", "wb') as of:
    json.dump(an_obj, io.TextIOWrapper(of))

invalidate_cache not always called

Some operations do not seems to show updated file information on the next query - apparently the out-of-date is being incorrectly kept around.

Add method to set metadata

Feature request (or current workaround)

set meta on existing object like like https://cloud.google.com/storage/docs/gsutil/commands/setmeta
dict parameter at file creation (open in w mode)

RuntimeError "The OAuth client was deleted"

I am trying to use gcsfs on a new system. I am invoking it as follows

import gcsfs
fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)

I get the following error, which I really don't understand or know how to recover from. Any advice would be appreciated. This is on a remote server where I have no gloud utilities installed. The python version is 3.5...don't know if that matters.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-90388e4b9890> in <module>()
      1 import gcsfs
----> 2 fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in __init__(self, project, access, token, block_size)
    162         self.access = access
    163         self.dirs = {}
--> 164         self.connect()
    165         self._singleton[0] = self
    166 

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in connect(self, refresh)
    229                               params={'client_id': not_secret['client_id'],
    230                                       'scope': scope})
--> 231             validate_response(r, path)
    232             data = json.loads(r.content.decode())
    233             print('Enter the following code when prompted in the browser:')

/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in validate_response(r, path)
    108             raise ValueError("Bad Request: %s\n%s" % (path, msg))
    109         else:
--> 110             raise RuntimeError(m)
    111 
    112 

RuntimeError: b'{\n  "error" : "deleted_client",\n  "error_description" : "The OAuth client was deleted."\n}'

glob through whole tree

Suggestion/request to have a glob option to traverse the full tree e.g.
GCSFS.glob('bucket/**/*.[gz|zip]', recursive=true) to list all matching files regardless of where located. This mimics glob in Python 3.6.

Workaround I can think of is to use walk() or ls() then regex

Invalid use of service credentials return via 'google_default'

Opening as a tracking issue for follow-up discussed #58. I will be able to follow-up on this issue the week of 2018-01-28. From that thread:

[Testing #58] on a GCE instance I use as a dev box. On this machine I've the compute sdk installed and I'm I'm using the service account for application default credentials:

fordas@salish:~/distributed_dev/gcsfs$ gcloud auth list
                  Credentialed Accounts
ACTIVE  ACCOUNT
*       [email protected]

In this context auth2client.client.GoogleCredentials.get_application_default returns an token component that manages access to the service credentials and does not include a refresh_token, client_secret. It instead provides access tokens via the get_access_token interface, which handles token expiration and refresh. It may also be used to retrieve auth headers for requests.

This currently breaks the current master implementation, which expects a refresh token to be available. ~~This will also break this implementation.~~

After setting application default credentials to a user account via gcloud auth application-default login auth2client.client.GoogleCredentials.get_application_default returns a GoogleCredentials object exposing refresh_token et al. This credentials object is capable of handling access token and auth header generation.

Given that behavior, my current recommendation would be to hold the credential object returned by the get_application_default and use the interface credential.get_access_token().access_token to manage access. gcsfs would then not be responsible for managing auth token management and refresh. It would also be desirable to wrap the device credentials generated via the browser auth mode in a the same oauth handler and remove the token management logic currently implemented in gcsfs.

In a broader sense, I would suggest emphasizing use of the default app credentials instead of obtaining and directly caching a credential object for gcsfs. In "off cloud" contexts, the user can then manage authorization via the gcloud auth application-default command, which provides login, revoke. In "on cloud" this then provides easy access to the default service account, with the option for a user to override this setting by gcloud auth login.

`ls` takes a long time with buckets with lots of files

I have a bucket with several directories, each of which has thousands of files. When querying one of the directories that has 3600 files a call to fs.ls("bucket/my-dir") takes several minutes. I actually don't know how long it takes since I've never let it finish. In contrast, the google client takes 3.4 seconds to return the directory listing [f for f in bucket.list_blobs(prefix='my-dir')] (the comprehension is to force iteration over the iterator object they return).

Use requests.Session to avoid repeated SSL handshakes

This can be nice when the other side is relatively fast

In [1]: import requests
In [2]: %timeit requests.get('https://en.wikipedia.org')
10 loops, best of 3: 116 ms per loop
In [3]: s = requests.Session()
In [4]: %timeit s.get('https://en.wikipedia.org')
10 loops, best of 3: 48.5 ms per loop

It's not clear yet how effective this will be in practice.

Credentials should be resolved from metadata service, if available, in default case.

Report

I'm seeing intermittent errors using gcsfs==0.3.0 in a dask.distributed cluster that, I believe, are due to an error in credential resolution. Briefly, when writing a large number of partitions to gcs via the dask.bag.to_textfiles interface, a number of work partitions failed with the log message:

Enter the following code when prompted in the browser:

....

 Traceback (most recent call last):
  File "/home/fordas/.conda/envs/dev/lib/python3.5/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
  File "/home/fordas/.conda/envs/dev/lib/python3.5/site-packages/gcsfs/core.py", line 597, in __setstate__
    self.connect()
  File "/home/fordas/.conda/envs/dev/lib/python3.5/site-packages/gcsfs/core.py", line 249, in connect
    raise RuntimeError("Waited too long for browser"
RuntimeError: Waited too long for browserauthentication.

However a number of the work partitions completed successfully. This occured while using multithreaded workers, so I assume that some race condition in token retrieval triggered the browser-based authentication flow. Modifying the filesystem interface in dask.bytes.core appeared to resolve the issue:

import dask.bytes
import gcsfs

import functools

org_fs = dask.bytes.core._filesystems['gcs']
wrapped_fs = functools.partial(org_fs, token="cloud")


dask.bytes.core._filesystems['gs'] = wrapped_fs
dask.bytes.core._filesystems['gcs'] = wrapped_fs

Credential Resolution

Credential resolution in the default case (i.e. GCSFileSystem(token=None)) should resolve credentials from the default service account, rather than falling back to the oauth device flow, if the internal metadata service is available. I'm not familiar with the oauth2client.client.GoogleCredential.get_application_default() interface currently used in master, but if that interface queries instance metadata by default then this is a "doc bug" to clarify this behavior in the gcsfs interface. The oauth2client.GoogleCredential.get_application_default interface appears to provide a token for the GCE Service account when invoked on a GCE instance and does not profive the "refresh_token" expected by gcsfs's token refresh logic. This poison's gcsfs's token cache and blocks authorization if token=None mode is used from a GCE instance.

Error Handling & Auth in Dask

Pending the behavior clarification above, it seems strange to me that dask would invoke gcsfs in a mode that could require interactive authorization. I believe gcsfs should support a headless authorization mode which resolves credentials from (a) the ~/.gcs_tokens token cache (b) the gcloud application default credentials and (c) instance metadata, but provides an informative error if interactive oauth would be required, then register this mode (rather than the current token=None mode) with dask.

This would allows users to authenticate manually via the current token=None mode, then utilize the cached token for all further access.

Possible need to munge paths

See dask/dask#2207 for dealing with complex URIs. Given that there is no current way to include user/path in a GCS URI, this may just be a no-op, but should be considered.

Managing multiple authentication tokens

I regularly access Google cloud services with two different google accounts. I typically switch between them by using gcloud auth login and then selecting the different account. However, even after doing so the gcsfs library is finding connection information for my other account. What logic does gcsfs use? I'm interested in tracking down from where it's getting its tokens.

Support deleting directories on gcs

Perhaps by default or with a recursive= keyword

More informative error for mkdir

Currently mkdir on a subdirectory fails with a non-informative error. It would be nice if this either worked (if that even makes sense) or else raised an informative "actually you don't need to do this" error.