Pythonic file-system for Google Cloud Storage
For documentation, go to readthedocs.
Pythonic file-system interface for Google Cloud Storage
Home Page: http://gcsfs.readthedocs.io/en/latest/
License: BSD 3-Clause "New" or "Revised" License
Pythonic file-system for Google Cloud Storage
For documentation, go to readthedocs.
We should release soon, but I think I would like to include batch-delete first (#69). Any other critical things that should be done?
I have a file of 30M lines and know that each line is max 2000 bytes long.
So if I want to iterate through the "file" (blob), I can specify the length of bytes to fetch to limit waste but doing that 3M times would cost huge latency.
I wonder if it would not be better - through an option - to read and cache a full block and read the lines from there.
This can be done outside of GCSFS but thought it may be embedded.
In [1]: import gcsfs
In [2]: gcsfs.GCSFileSystem()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-d964e936fddf> in <module>()
----> 1 gcsfs.GCSFileSystem()
/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/gcsfs/core.py in __init__(self, project, access, token, block_size)
158 self.access = access
159 self.dirs = {}
--> 160 self.connect()
161 self._singleton[0] = self
162
/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/gcsfs/core.py in connect(self, refresh)
267 'refresh_token': data['refresh_token'],
268 'grant_type': "refresh_token"})
--> 269 validate_response(r, path)
270 data['timestamp'] = time.time()
271 data['access_token'] = r.json()['access_token']
/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/gcsfs/core.py in validate_response(r, path)
104 raise ValueError("Bad Request: %s" % path)
105 else:
--> 106 raise RuntimeError(msg)
107
108
RuntimeError: b'{\n "error": "deleted_client",\n "error_description": "The OAuth client was deleted."\n}\n'
Looking around on the internet, people seem to suggest that I enable OAuth on my account. This seems odd though because this didn't seem to be necessary before. Any thoughts on what might be going on?
GCS provides size and md5 hash information on written files. At the minimum, the former should be checked against the expected upload size, and optionally the md5 also for a stronger guarantee.
I'm trying to push a large-ish dataset to GCS via xarray/dask/zarr/gcsfs. Things are generally working during the setup and for the first part of the upload. However, after a bit, I'm getting a ConnectionError
that is not recoverable. I'm pushing from a server at the University of Washington to bucket at "US-CENTRAL1". I would image the network at UW is pretty stable.
xarray: jhamman:fix/zarr_set_attrs
dask: 0.16.0
zarr: master
gcsfs: master
Details of the full traceback are in this gist: https://gist.github.com/jhamman/25ddda993ad5b768e4b8289904be6779
Google returns ValueError: Only 'authorized_user' tokens accepted, got: service_account
. Limitation or can that be extended ?
Note: @martindurant this is yet another great lib !! G only implemented cloudstorage
for their GAE Standard. File-like objects are perfect to minimize memory usage and skip the need for disk.
I would like to set up GCSFuse in a docker container pointing to a read-only public bucket. This container will not have any Google authentication by default. Is it possible to create a GCSFileSystem
without associating it to a google account and without using token='cloud'
?
In [5]: import pandas as pd
...: import dask.dataframe as dd
...: dd.from_pandas(pd.DataFrame({'x': [1]}), npartitions=1).to_csv(f'gs://{bucket}/test.csv')
...:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/miniconda3/envs/model/lib/python3.6/site-packages/gcsfs/core.py in flush(self, force)
905 if self.location is not None:
--> 906 self._upload_chunk(final=force)
907 if force:
~/miniconda3/envs/model/lib/python3.6/site-packages/gcsfs/core.py in _upload_chunk(self, final)
930 headers=head, data=data)
--> 931 validate_response(r, self.location)
932 if 'Range' in r.headers:
~/miniconda3/envs/model/lib/python3.6/site-packages/gcsfs/core.py in validate_response(r, path)
131 else:
--> 132 raise RuntimeError(m)
133
RuntimeError: b"Invalid request. The number of bytes uploaded is required to be equal or greater than 262144, except for the final request (it's recommended to be the exact multiple of 262144). The received request contained 7 bytes, which does not meet this requirement."
The same behavior happens with a simple string write:
In [8]: with gcsfs.GCSFileSystem(project=project).open(f'{bucket}/test.txt', 'w') as f:
...: f.write('test')
...
RuntimeError: b"Invalid request. The number of bytes uploaded is required to be equal or greater than 262144, except for the final request (it's recommended to be the exact multiple of 262144). The received request contained 4 bytes, which does not meet this requirement."
But weirdly enough, changing the write type to binary resolves this:
In [9]: with gcsfs.GCSFileSystem(project='model-streetlight-toronto').open('persona-upload/test.txt', 'wb') as f:
...: f.write(b'test')
...:
In [10]: # OK
I'm trying to remember off the top of my head if this is new behavior but I'm not sure.
gcsfs==0.0.4
google-cloud-storage==1.7.0
I'm trying to read a publicly readable bucket using the GCSMap
. My account doesn't have much in the way of permissions, but the bucket is publicly readable.
fs = gcsfs.GCSFileSystem(token='cloud')
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newmann-met-ensemble', gcs=fs)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-1c1ffef3b0ee> in <module>()
1 fs = gcsfs.GCSFileSystem(token='cloud')
----> 2 gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newmann-met-ensemble', gcs=fs)
/opt/conda/lib/python3.6/site-packages/gcsfs/mapping.py in __init__(self, root, gcs, check, create)
41 if create:
42 self.gcs.mkdir(bucket)
---> 43 elif not self.gcs.exists(bucket):
44 raise ValueError("Bucket %s does not exist."
45 " Create bucket with the ``create=True`` keyword" %
/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in exists(self, path)
457 return bool(self.info(path))
458 else:
--> 459 return bucket in self.ls('')
460 except FileNotFoundError:
461 return False
/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in ls(self, path, detail)
374 def ls(self, path, detail=False):
375 if path in ['', '/']:
--> 376 out = self._list_buckets()
377 else:
378 bucket, prefix = split_path(path)
/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in _list_buckets(self)
308 def _list_buckets(self):
309 if '' not in self.dirs:
--> 310 out = self._call('get', 'b/', project=self.project)
311 dirs = out.get('items', [])
312 self.dirs[''] = dirs
/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
303 except ValueError:
304 out = r.content
--> 305 validate_response(r, path)
306 return out
307
/opt/conda/lib/python3.6/site-packages/gcsfs/core.py in validate_response(r, path)
107 raise IOError("Forbidden: %s\n%s" % (path, msg))
108 elif "invalid" in m:
--> 109 raise ValueError("Bad Request: %s\n%s" % (path, msg))
110 else:
111 raise RuntimeError(m)
ValueError: Bad Request: b/
Invalid argument
@asford https://github.com/dask/gcsfs/blob/master/gcsfs/core.py#L559 - should this not be moved to the line above like "items": [self._process_object(bucket, i) for i in items],
? As it is, the items
are being processed and then ignored.
I am trying to use the latest gcsfs master on pangeo.pydata.org (see pangeo-data/pangeo#112 for context).
From within the notebook, I do
! pip install --upgrade --user git+https://github.com/dask/gcsfs.git
When I try import gcsfs
I get the following traceback
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-12-3f25f74e3f1b> in <module>()
----> 1 import gcsfs
~/.local/lib/python3.6/site-packages/gcsfs/__init__.py in <module>()
1 from __future__ import absolute_import
2
----> 3 from .core import GCSFileSystem
4 from .dask_link import register as register_dask
5 from .mapping import GCSMap
~/.local/lib/python3.6/site-packages/gcsfs/core.py in <module>()
985
986
--> 987 GCSFileSystem.load_tokens()
988
989
~/.local/lib/python3.6/site-packages/gcsfs/core.py in load_tokens()
307 tokens = {k: (GCSFileSystem._dict_to_credentials(v)
308 if isinstance(v, dict) else v)
--> 309 for k, v in tokens.items()}
310 except IOError:
311 tokens = {}
~/.local/lib/python3.6/site-packages/gcsfs/core.py in <dictcomp>(.0)
307 tokens = {k: (GCSFileSystem._dict_to_credentials(v)
308 if isinstance(v, dict) else v)
--> 309 for k, v in tokens.items()}
310 except IOError:
311 tokens = {}
~/.local/lib/python3.6/site-packages/gcsfs/core.py in _dict_to_credentials(token)
335 """
336 return Credentials(
--> 337 None, refresh_token=token['refresh_token'],
338 client_secret=token['client_secret'],
339 client_id=token['client_id'],
KeyError: 'refresh_token'
If gcsfs is to be used in the context of dask, it should not pass around the authorization token when pickled. Instead, we should require the user to have distributed a token file to each machine beforehand or otherwise set up permissions on each node, as we do for s3fs.
After overcoming #82, I am on to a new error.
I set my GOOGLE_APPLICATION_CREDENTIALS
to point to an appropriate .json file. I verified that the config is working with the following code
In [10]: from google.cloud import storage
In [11]: storage_client = storage.Client()
In [12]: list(storage_client.list_buckets())
Out[12]:
[<Bucket: pangeo>,
<Bucket: pangeo-data>,
<Bucket: pangeo-data-private>,
<Bucket: zarr_store_test>]
Now I am trying to do the same thing from gcsfs. I get an error invalid_scope: Empty or missing scope not allowed
.
In [8]: fs = gcsfs.GCSFileSystem()
In [9]: fs.buckets
DEBUG:gcsfs.core:_list_buckets(args=(), kwargs={})
DEBUG:gcsfs.core:_call(args=('get', 'b/'), kwargs={'project': 'pangeo-181919'})
ERROR:gcsfs.core:_call exception: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
r = meth(self.base + path, params=kwargs, json=json)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
self._auth_request, method, url, request_headers)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
self.refresh(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py", line 322, in refresh
request, self._token_uri, assertion)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 145, in jwt_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 111, in _token_endpoint_request
_handle_error_response(response_body)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 61, in _handle_error_response
error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
ERROR:gcsfs.core:_call exception: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
r = meth(self.base + path, params=kwargs, json=json)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
self._auth_request, method, url, request_headers)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
self.refresh(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py", line 322, in refresh
request, self._token_uri, assertion)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 145, in jwt_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 111, in _token_endpoint_request
_handle_error_response(response_body)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 61, in _handle_error_response
error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
ERROR:gcsfs.core:_call exception: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
r = meth(self.base + path, params=kwargs, json=json)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
self._auth_request, method, url, request_headers)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
self.refresh(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py", line 322, in refresh
request, self._token_uri, assertion)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 145, in jwt_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 111, in _token_endpoint_request
_handle_error_response(response_body)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 61, in _handle_error_response
error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
ERROR:gcsfs.core:_call exception: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
r = meth(self.base + path, params=kwargs, json=json)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
self._auth_request, method, url, request_headers)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
self.refresh(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py", line 322, in refresh
request, self._token_uri, assertion)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 145, in jwt_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 111, in _token_endpoint_request
_handle_error_response(response_body)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py", line 61, in _handle_error_response
error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
---------------------------------------------------------------------------
RefreshError Traceback (most recent call last)
<ipython-input-9-bb76ad6e7216> in <module>()
----> 1 fs.buckets
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in buckets(self)
449 def buckets(self):
450 """Return list of available project buckets."""
--> 451 return [b["name"] for b in self._list_buckets()["items"]]
452
453 @classmethod
<decorator-gen-122> in _list_buckets(self)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _list_buckets(self)
568 items = []
569 page = self._call(
--> 570 'get', 'b/', project=self.project
571 )
572
<decorator-gen-117> in _call(self, method, path, *args, **kwargs)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
435 logger.exception("_call exception: %s", e)
436 if retry == self.retries - 1:
--> 437 raise e
438 if is_retriable(e):
439 # retry
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
429 try:
430 time.sleep(2**retry - 1)
--> 431 r = meth(self.base + path, params=kwargs, json=json)
432 validate_response(r, path)
433 break
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in get(self, url, **kwargs)
519
520 kwargs.setdefault('allow_redirects', True)
--> 521 return self.request('GET', url, **kwargs)
522
523 def options(self, url, **kwargs):
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py in request(self, method, url, data, headers, **kwargs)
196
197 self.credentials.before_request(
--> 198 self._auth_request, method, url, request_headers)
199
200 response = super(AuthorizedSession, self).request(
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py in before_request(self, request, method, url, headers)
119 # the http request.)
120 if not self.valid:
--> 121 self.refresh(request)
122 self.apply(headers)
123
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/service_account.py in refresh(self, request)
320 assertion = self._make_authorization_grant_assertion()
321 access_token, expiry, _ = _client.jwt_grant(
--> 322 request, self._token_uri, assertion)
323 self.token = access_token
324 self.expiry = expiry
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py in jwt_grant(request, token_uri, assertion)
143 }
144
--> 145 response_data = _token_endpoint_request(request, token_uri, body)
146
147 try:
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py in _token_endpoint_request(request, token_uri, body)
109
110 if response.status != http_client.OK:
--> 111 _handle_error_response(response_body)
112
113 response_data = json.loads(response_body)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/oauth2/_client.py in _handle_error_response(response_body)
59
60 raise exceptions.RefreshError(
---> 61 error_details, response_body)
62
63
RefreshError: ('invalid_scope: Empty or missing scope not allowed.', '{\n "error" : "invalid_scope",\n "error_description" : "Empty or missing scope not allowed."\n}')
I am using the latest gcsfs master, installed from pip + git.
It'd be nice to be able to use this library just using gcloud's default credentials, instead of having to juggle files.
Although there is no batch delete as such, there is such a concept as batch request in GCS: https://cloud.google.com/storage/docs/json_api/v1/how-tos/batch
I am trying to use gcsfs via distributed in pangeo-data/pangeo#150. I have uncovered what seems like a serialization bug.
This works from my notebook (the token appears to be cached):
fs = gcsfs.GCSFileSystem(project='pangeo-181919')
fs.buckets
It returns the four buckets: ['pangeo', 'pangeo-data', 'pangeo-data-private', 'zarr_store_test']
.
Now I create a distributed cluster and client and use it to run the same command:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(processes=False)
client = Client(cluster)
client.run(lambda : fs.buckets)
I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-25-3de328a517c7> in <module>()
----> 1 client.run(lambda : fs.buckets)
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py in run(self, function, *args, **kwargs)
1906 '192.168.0.101:9000': 'running}
1907 """
-> 1908 return self.sync(self._run, function, *args, **kwargs)
1909
1910 @gen.coroutine
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
601 return future
602 else:
--> 603 return sync(self.loop, func, *args, **kwargs)
604
605 def __repr__(self):
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
251 e.wait(10)
252 if error[0]:
--> 253 six.reraise(*error[0])
254 else:
255 return result[0]
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/utils.py in f()
235 yield gen.moment
236 thread_state.asynchronous = True
--> 237 result[0] = yield make_coro()
238 except Exception as exc:
239 logger.exception(exc)
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/util.py in raise_exc_info(exc_info)
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py in run(self)
1067 exc_info = None
1068 else:
-> 1069 yielded = self.gen.send(value)
1070
1071 if stack_context._state.contexts is not orig_stack_contexts:
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py in _run(self, function, *args, **kwargs)
1860 results[key] = resp['result']
1861 elif resp['status'] == 'error':
-> 1862 six.reraise(*clean_exception(**resp))
1863 raise gen.Return(results)
1864
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/six.py in reraise(tp, value, tb)
690 value = tp()
691 if value.__traceback__ is not tb:
--> 692 raise value.with_traceback(tb)
693 raise value
694 finally:
<ipython-input-25-3de328a517c7> in <lambda>()
----> 1 client.run(lambda : fs.buckets)
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in buckets()
449 def buckets(self):
450 """Return list of available project buckets."""
--> 451 return [b["name"] for b in self._list_buckets()["items"]]
452
453 @classmethod
<decorator-gen-128> in _list_buckets()
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _tracemethod()
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _list_buckets()
568 items = []
569 page = self._call(
--> 570 'get', 'b/', project=self.project
571 )
572
<decorator-gen-123> in _call()
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _tracemethod()
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _call()
429 try:
430 time.sleep(2**retry - 1)
--> 431 r = meth(self.base + path, params=kwargs, json=json)
432 validate_response(r, path)
433 break
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/requests/sessions.py in get()
519
520 kwargs.setdefault('allow_redirects', True)
--> 521 return self.request('GET', url, **kwargs)
522
523 def options(self, url, **kwargs):
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/google/auth/transport/requests.py in request()
195 request_headers = headers.copy() if headers is not None else {}
196
--> 197 self.credentials.before_request(
198 self._auth_request, method, url, request_headers)
199
AttributeError: 'AuthorizedSession' object has no attribute 'credentials'
I am trying to use gcsfs to access gcs from the NASA pleaides supercomputer.
I tried to initialize a gcsfs.mapping before having my credentials set up properly (although I thought my default credentials were valid.)
I was able to get past this error by setting GOOGLE_APPLICATION_CREDENTIALS
environment variable as described in the google cloud storage docs.
However, the info I got from gcsfs was now very helpful.
In [1]: import logging
...: logging.basicConfig(
...: format="%(created)0.3f %(levelname)s %(name)s %(message)s",
...: level=logging.INFO)
...: logging.getLogger("gcsfs.core").setLevel(logging.DEBUG)
...: logging.getLogger("gcsfs.gcsfs").setLevel(logging.DEBUG)
In [2]: import gcsfs
In [3]: fs = gcsfs.GCSFileSystem(project='pangeo-181919')
1519678908.044 INFO google.auth.compute_engine._metadata Compute Engine Metadata server unavailable.
1519678908.044 DEBUG gcsfs.core Connection with method "google_default" failed
In [4]: gcsmap = gcsfs.mapping.GCSMap('pangeo-data', gcs=fs)
1519679002.503 DEBUG gcsfs.core exists(args=('pangeo-data',), kwargs={})
1519679002.503 DEBUG gcsfs.core _list_buckets(args=(), kwargs={})
1519679002.503 DEBUG gcsfs.core _call(args=('get', 'b/'), kwargs={'project': 'pangeo-181919'})
1519679002.519 ERROR gcsfs.core _call exception: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
conn = self._new_conn()
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 120, in __call__
**kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 90, in refresh
self._retrieve_info(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 72, in _retrieve_info
service_account=self._service_account_email)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
recursive=True)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
response = request(url=url, method='GET', headers=_METADATA_HEADERS)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 124, in __call__
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
r = meth(self.base + path, params=kwargs, json=json)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
self._auth_request, method, url, request_headers)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
self.refresh(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57c18>: Failed to establish a new connection: [Errno -2] Name or service not known',))
1519679004.039 ERROR gcsfs.core _call exception: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
conn = self._new_conn()
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 120, in __call__
**kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 90, in refresh
self._retrieve_info(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 72, in _retrieve_info
service_account=self._service_account_email)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
recursive=True)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
response = request(url=url, method='GET', headers=_METADATA_HEADERS)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 124, in __call__
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
r = meth(self.base + path, params=kwargs, json=json)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
self._auth_request, method, url, request_headers)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
self.refresh(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca57d30>: Failed to establish a new connection: [Errno -2] Name or service not known',))
1519679007.052 ERROR gcsfs.core _call exception: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
conn = self._new_conn()
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 120, in __call__
**kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 90, in refresh
self._retrieve_info(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 72, in _retrieve_info
service_account=self._service_account_email)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
recursive=True)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
response = request(url=url, method='GET', headers=_METADATA_HEADERS)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 124, in __call__
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
r = meth(self.base + path, params=kwargs, json=json)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
self._auth_request, method, url, request_headers)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
self.refresh(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcca72f98>: Failed to establish a new connection: [Errno -2] Name or service not known',))
1519679016.526 ERROR gcsfs.core _call exception: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
conn = self._new_conn()
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 120, in __call__
**kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 90, in refresh
self._retrieve_info(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 72, in _retrieve_info
service_account=self._service_account_email)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
recursive=True)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
response = request(url=url, method='GET', headers=_METADATA_HEADERS)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 124, in __call__
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py", line 431, in _call
r = meth(self.base + path, params=kwargs, json=json)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py", line 198, in request
self._auth_request, method, url, request_headers)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py", line 121, in before_request
self.refresh(request)
File "/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
---------------------------------------------------------------------------
gaierror Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py in _new_conn(self)
140 conn = connection.create_connection(
--> 141 (self.host, self.port), self.timeout, **extra_kw)
142
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
59
---> 60 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
61 af, socktype, proto, canonname, sa = res
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/socket.py in getaddrinfo(host, port, family, type, proto, flags)
744 addrlist = []
--> 745 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
746 af, socktype, proto, canonname, sa = res
gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
NewConnectionError Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
600 body=body, headers=headers,
--> 601 chunked=chunked)
602
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
356 else:
--> 357 conn.request(method, url, **httplib_request_kw)
358
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in request(self, method, url, body, headers, encode_chunked)
1238 """Send a complete request to the server."""
-> 1239 self._send_request(method, url, body, headers, encode_chunked)
1240
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
1284 body = _encode(body, 'body')
-> 1285 self.endheaders(body, encode_chunked=encode_chunked)
1286
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in endheaders(self, message_body, encode_chunked)
1233 raise CannotSendHeader()
-> 1234 self._send_output(message_body, encode_chunked=encode_chunked)
1235
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in _send_output(self, message_body, encode_chunked)
1025 del self._buffer[:]
-> 1026 self.send(msg)
1027
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/http/client.py in send(self, data)
963 if self.auto_open:
--> 964 self.connect()
965 else:
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py in connect(self)
165 def connect(self):
--> 166 conn = self._new_conn()
167 self._prepare_conn(conn)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connection.py in _new_conn(self)
149 raise NewConnectionError(
--> 150 self, "Failed to establish a new connection: %s" % e)
151
NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
MaxRetryError Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
439 retries=self.max_retries,
--> 440 timeout=timeout
441 )
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
638 retries = retries.increment(method, url, error=e, _pool=self,
--> 639 _stacktrace=sys.exc_info()[2])
640 retries.sleep()
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
387 if new_retry.is_exhausted():
--> 388 raise MaxRetryError(_pool, url, error or ResponseError(cause))
389
MaxRetryError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py in __call__(self, url, method, body, headers, timeout, **kwargs)
119 method, url, data=body, headers=headers, timeout=timeout,
--> 120 **kwargs)
121 return _Response(response)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
507 send_kwargs.update(settings)
--> 508 resp = self.send(prep, **send_kwargs)
509
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
617 # Send the request
--> 618 r = adapter.send(request, **kwargs)
619
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
507
--> 508 raise ConnectionError(e, request=request)
509
ConnectionError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
TransportError Traceback (most recent call last)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py in refresh(self, request)
89 try:
---> 90 self._retrieve_info(request)
91 self.token, self.expiry = _metadata.get_service_account_token(
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py in _retrieve_info(self, request)
71 request,
---> 72 service_account=self._service_account_email)
73
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py in get_service_account_info(request, service_account)
178 'instance/service-accounts/{0}/'.format(service_account),
--> 179 recursive=True)
180
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py in get(request, path, root, recursive)
114
--> 115 response = request(url=url, method='GET', headers=_METADATA_HEADERS)
116
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py in __call__(self, url, method, body, headers, timeout, **kwargs)
123 new_exc = exceptions.TransportError(caught_exc)
--> 124 six.raise_from(new_exc, caught_exc)
125
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
The above exception was the direct cause of the following exception:
RefreshError Traceback (most recent call last)
<ipython-input-4-8e538214a4de> in <module>()
----> 1 gcsmap = gcsfs.mapping.GCSMap('pangeo-data', gcs=fs)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/mapping.py in __init__(self, root, gcs, check, create)
41 if create:
42 self.gcs.mkdir(bucket)
---> 43 elif not self.gcs.exists(bucket):
44 raise ValueError("Bucket %s does not exist."
45 " Create bucket with the ``create=True`` keyword" %
<decorator-gen-131> in exists(self, path)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in exists(self, path)
768 return bool(self.info(path))
769 else:
--> 770 if bucket in self.buckets:
771 return True
772 else:
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in buckets(self)
449 def buckets(self):
450 """Return list of available project buckets."""
--> 451 return [b["name"] for b in self._list_buckets()["items"]]
452
453 @classmethod
<decorator-gen-122> in _list_buckets(self)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _list_buckets(self)
568 items = []
569 page = self._call(
--> 570 'get', 'b/', project=self.project
571 )
572
<decorator-gen-117> in _call(self, method, path, *args, **kwargs)
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
435 logger.exception("_call exception: %s", e)
436 if retry == self.retries - 1:
--> 437 raise e
438 if is_retriable(e):
439 # retry
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
429 try:
430 time.sleep(2**retry - 1)
--> 431 r = meth(self.base + path, params=kwargs, json=json)
432 validate_response(r, path)
433 break
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in get(self, url, **kwargs)
519
520 kwargs.setdefault('allow_redirects', True)
--> 521 return self.request('GET', url, **kwargs)
522
523 def options(self, url, **kwargs):
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/transport/requests.py in request(self, method, url, data, headers, **kwargs)
196
197 self.credentials.before_request(
--> 198 self._auth_request, method, url, request_headers)
199
200 response = super(AuthorizedSession, self).request(
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/credentials.py in before_request(self, request, method, url, headers)
119 # the http request.)
120 if not self.valid:
--> 121 self.refresh(request)
122 self.apply(headers)
123
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py in refresh(self, request)
94 except exceptions.TransportError as caught_exc:
95 new_exc = exceptions.RefreshError(caught_exc)
---> 96 six.raise_from(new_exc, caught_exc)
97
98 @property
/nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fffcc963128>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I used gcsfs to read about 800 files and it failed for 3 of them with this error:
Connection broken: error("(104, \'ECONNRESET\')",)
The call stack looks like this:
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 536, in open
return GCSFile(self, path, mode, block_size)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 622, in __init__
self.details = gcsfs.info(path)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 464, in info
files = self.ls(path, True)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 379, in ls
files = self._list_bucket(bucket)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 320, in _list_bucket
out = self._call('get', 'b/{}/o/', bucket, maxResults=max_results, pageToken=next_page_token)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 299, in _call
json=json)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 658, in send
r.content
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 823, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 748, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: error("(104, \'ECONNRESET\')",)', error("(104, 'ECONNRESET')",))
From the outside it looks like sometimes GCS has some sort of transient errors and GCSFS ought to be able to survive those and retry.
Original discussion here
Quick summary: When reading a single file from a bucket containing a very large number of files, the read time can be very long before gcsfs downloads the list of all the files in the bucket. For a bucket with over 4 million files, this can take up to 15 minutes.
Suggested fix (easy): Check if the object path contains a wildcard. If it does not, then just download file. If it does, then proceed to download list of files.
Suggested fix (hard, not sure how to do yet, suggested by @martindurant ): implement prefix/delimiter listing, i.e., make it possible to use wildcard in object path (.e.g., bag.read_text('gs://mybucket/2017/01/*.csv')
) without having to download list of files from entire bucket.
I am trying to use gcsfs with a GCS service account .json token. I created a token at https://console.cloud.google.com/iam-admin/serviceaccounts/ and assigned it the role of "Storage Admin". This should have permissions to do anything to my GCS resources. I downloaded the .json token.
I use this with gcsfs as follows:
fs = gcsfs.GCSFileSystem(project='pangeo-181919', token='/home/rpa/pangeo-bf62fe06ed97.json')
fs.buckets
I get this error:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-14-81fa34b27c7f> in <module>()
1 # connect to gcs
2 fs = gcsfs.GCSFileSystem(project='pangeo-181919', token='/home/rpa/pangeo-bf62fe06ed97.json')
----> 3 fs.buckets
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in buckets(self)
449 def buckets(self):
450 """Return list of available project buckets."""
--> 451 return [b["name"] for b in self._list_buckets()["items"]]
452
453 @classmethod
<decorator-gen-128> in _list_buckets(self)
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _list_buckets(self)
568 items = []
569 page = self._call(
--> 570 'get', 'b/', project=self.project
571 )
572
<decorator-gen-123> in _call(self, method, path, *args, **kwargs)
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
49 logger.log(logging.DEBUG - 1, tb_io.getvalue())
50
---> 51 return f(self, *args, **kwargs)
52
53
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
430 time.sleep(2**retry - 1)
431 r = meth(self.base + path, params=kwargs, json=json)
--> 432 validate_response(r, path)
433 break
434 except (HtmlError, RequestException, GoogleAuthError) as e:
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in validate_response(r, path)
156 raise FileNotFoundError(path)
157 elif "forbidden" in m:
--> 158 raise IOError("Forbidden: %s\n%s" % (path, msg))
159 elif "invalid" in m:
160 raise ValueError("Bad Request: %s\n%s" % (path, msg))
OSError: Forbidden: b/
[email protected] does not have storage.buckets.list access to project 464800473488.
This doesn't make sense to me. It seems like listing buckets should definitely be within the privileges of the storage admin.
The following causes both terminals to hang in such a way that they can not be interrupted
$ pip install gcsfs --upgrade
$ mkdir gcs
$ gcsfuse pangeo-data gcs
Mounting bucket pangeo-data to directory gcs
$ ls gcs
When using gcsfs in master it is difficult to detect what commit I'm on. Any thoughts on switching to using versioneer for versions? Dask and distributed both do this and could serve as models.
when checking on bucket existence, I get
ValueError: Bad Request: b/
Unknown project id: 0
Issue is actually with _list_buckets()
and the get
call to list buckets. The API requires the pass the project ID, not the project name - see https://cloud.google.com/storage/docs/json_api/v1/buckets/list and https://cloud.google.com/storage/docs/projects
BTW, that ID is also required to create a bucket so mkdir()
has same issue
This was not spotted before if you happened to have project name = ID
Ancillary issue that would likely be linked to #59
As per the module documentation oauth2client
is officially deprecated in favor of google-auth
. As google-auth
is an officially support project with a stable API, any further development of the gcsfs
authorization system should target this library.
The current oauth implementation should likely replaced via a standard implementation using google-auth
for token managment and the related google-auth-oauthlib.flow
for device authorization.
Sorry for logging another one !
This one seems related to the cached credentials.
File "g.py", line 61, in <module>
recs += 1
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 896, in __exit__
self.close()
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 868, in close
self.flush(force=True)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 750, in flush
self._simple_upload()
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 810, in _simple_upload
validate_response(r, path)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 111, in validate_response
raise RuntimeError(m)
RuntimeError: {
"error": {
"errors": [
{
"domain": "global",
"reason": "authError",
"message": "Invalid Credentials",
"locationType": "header",
"location": "Authorization"
}
],
"code": 401,
"message": "Invalid Credentials"
}
}
I wonder if increasing the block size to say 50MB would not reduce the likehood of these error (overload, 'corrupted' credentials,...)
Still having problems on pangeo.pydata.org using the latest gcsfs installed this way from my notebook.
! pip install --upgrade --user git+https://github.com/dask/gcsfs.git
From the xarray-data example notebook.
import xarray as xr
import gcsfs
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newman-met-ensemble')
ds = xr.open_zarr(gcsmap)
produces this error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-6411cc50b0ab> in <module>()
6 import gcsfs
7 gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newman-met-ensemble')
----> 8 ds = xr.open_zarr(gcsmap)
9
10 # Print dataset
/opt/conda/lib/python3.6/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, auto_chunk, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables)
470 zarr_store = ZarrStore.open_group(store, mode=mode,
471 synchronizer=synchronizer,
--> 472 group=group)
473 ds = maybe_decode_store(zarr_store)
474
/opt/conda/lib/python3.6/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, writer)
276 import zarr
277 zarr_group = zarr.open_group(store=store, mode=mode,
--> 278 synchronizer=synchronizer, path=group)
279 return cls(zarr_group, writer=writer)
280
/opt/conda/lib/python3.6/site-packages/zarr/hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path)
1108
1109 if mode in ['r', 'r+']:
-> 1110 if contains_array(store, path=path):
1111 err_contains_array(path)
1112 elif not contains_group(store, path=path):
/opt/conda/lib/python3.6/site-packages/zarr/storage.py in contains_array(store, path)
70 prefix = _path_to_prefix(path)
71 key = prefix + array_meta_key
---> 72 return key in store
73
74
~/.local/lib/python3.6/site-packages/gcsfs/mapping.py in __contains__(self, key)
86
87 def __contains__(self, key):
---> 88 return self.gcs.exists(self._key_to_str(key))
89
90 def __len__(self):
<decorator-gen-137> in exists(self, path)
~/.local/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
51 logger.log(logging.DEBUG - 1, tb_io.getvalue())
52
---> 53 return f(self, *args, **kwargs)
54
55 # client created 23-Sept-2017
~/.local/lib/python3.6/site-packages/gcsfs/core.py in exists(self, path)
771 try:
772 if key:
--> 773 return bool(self.info(path))
774 else:
775 if bucket in self.buckets:
<decorator-gen-138> in info(self, path)
~/.local/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
51 logger.log(logging.DEBUG - 1, tb_io.getvalue())
52
---> 53 return f(self, *args, **kwargs)
54
55 # client created 23-Sept-2017
~/.local/lib/python3.6/site-packages/gcsfs/core.py in info(self, path)
803
804 try:
--> 805 return self._get_object(path)
806 except FileNotFoundError:
807 logger.debug("info FileNotFound at path: %s", path)
<decorator-gen-126> in _get_object(self, path)
~/.local/lib/python3.6/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
51 logger.log(logging.DEBUG - 1, tb_io.getvalue())
52
---> 53 return f(self, *args, **kwargs)
54
55 # client created 23-Sept-2017
~/.local/lib/python3.6/site-packages/gcsfs/core.py in _get_object(self, path)
485 raise FileNotFoundError(path)
486
--> 487 result = self._process_object(bucket, self._call('get', 'b/{}/o/{}', bucket, key))
488
489 logger.debug("_get_object result: %s", result)
~/.local/lib/python3.6/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
434 time.sleep(2**retry - 1)
435 r = meth(self.base + path, params=kwargs, json=json)
--> 436 validate_response(r, path)
437 break
438 except (HtmlError, RequestException, GoogleAuthError) as e:
~/.local/lib/python3.6/site-packages/gcsfs/core.py in validate_response(r, path)
154 msg = str(r.content)
155
--> 156 if DEBUG:
157 print(r.url, r.headers, sep='\n')
158 if "Not Found" in m:
NameError: name 'DEBUG' is not defined
In anonymous mode, or when accessing a bucket not included in our project, it will not appear in the buckets list (or, indeed, listing buckets may not be possible). In this case, exists()
on a bucket should instead attempt to list the bucket and return True if that succeeds.
Is it possible to make one release of the library on pypi?
My script runs fine remotely - from laptop - but when running on a GCE instance, I get the following error:
File "g.py", line 56, in <module>
recs += 1
File "xxx/gcsfs/core.py", line 896, in __exit__
self.close()
File "xxx/gcsfs/core.py", line 868, in close
self.flush(force=True)
File "xxx/gcsfs/core.py", line 754, in flush
self._initiate_upload()
File "xxx/gcsfs/core.py", line 798, in _initiate_upload
self.location = r.headers['Location']
File "xxx/requests/structures.py", line 54, in __getitem__
return self._store[key.lower()][1]
KeyError: 'location'
Exception ValueError: ValueError('Force flush cannot be called more than once',) in <bound method GCSFile.__del__ of <GCSFile testfile>> ignored
Code snippet is:
...
with fs.open(seg_filename, 'wb') as seg_f:
recs = 0
while line.startswith(key):
seg_f.write(line[line_offset:])
line = f.readline()
recs += 1
logger.info('Done writing {}: {:,} records'.format(seg_filename, recs))
Unrelated but I also get a warning No module named dask.bytes.core
I have faced several times the same 502 error when writing in chunks. Errors are sporadic and look like:
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 896, in __exit__
self.close()
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 868, in close
self.flush(force=True)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 755, in flush
self._upload_chunk(final=force)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 777, in _upload_chunk
validate_response(r, self.location)
File "/usr/local/lib/python2.7/dist-packages/gcsfs/core.py", line 111, in validate_response
raise RuntimeError(m)
RuntimeError: <!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
<title>Error 502 (Server Error)!!1</title>
<style>
*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
</style>
<a href=//www.google.com/><span id=logo aria-label=Google></span></a>
<p><b>502.</b> <ins>That's an error.</ins>
<p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds. <ins>That's all we know.</ins>
Exception ValueError: ValueError('Force flush cannot be called more than once',) in <bound method GCSFile.__del__ of <GCSFile tempfile.txt>> ignored
Google writes this may happen when system/network is under stress - our script does indeed read and write from GCS
several GB of files in chunks after on-the-fly transformations.
Look at Handling Errors
at https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload.
They indicate to implement exponential-backoff
: https://cloud.google.com/storage/docs/exponential-backoff
Is this something that can be implemented in GCSFS
i.e. resubmit a chunk when such 5xx
error occurs ?
If not, wrapping the write statement in a try/exception, delete the target file written so far and reprocess is not an option for me. Because the records I write are based off data in a compressed GCSFS
-file I read. ZipFileExt are not seekable so what is read is gone. I would have to delete the whole batch and restart.
more a question than an issue: I found this while looking for a s3fs (dask/s3fs) alternative to be used as zarr backend. We are looking to make our processing less dependend on AWS. Would you say gcsfs is already useable as a s3fs replacement for zarr?
(edit: sorry for the confusion, I typed 's3ql' instead of 's3fs' the first time. Corrected now.)
The implementation of walk in gcsfs
returns a flat list of all paths encountered when traversing the tree. In contrast, os.walk
from the standard library returns an iterator of dirname, dirs, files
, where dirname
is the current directory prefix in the traversal, and files
and dirs
are the names (relative, not absolute) of all files and directories respectively.
This difference in behavior is confusing for those familiar with the standard os.walk
method, and also makes providing the interface used by pyarrow
(which does match the standard library) tricky.
@quasiben and I conda-installed gcsfs in a fresh environment. We found that we needed dask-core and toolz for things to function properly.
When bucket in passed path does not exists, GSCFile fails on close (or when buffer is flushed), with a generic error like
Exception KeyError: ('location',) in <bound method GCSFile.__del__ of <GCSFile new_bucket/test>> ignored
GSCFile.__init__
should check on bucket existence and either create (preferred) or explicitly error out e.g. Bucket xxx does not exist
First option preferred to minimize pre-processing but requires parameters for storage class
& location
. Although could be defined as defaults.
Btw - do not use paranthesis in folder name, you cannot browse it in Google Console
(bug reported)
I am trying to create a new GCSFileSystem with the command
fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)
This triggers my browser to open and I have to enter the code. But I can never seem to finish it fast enough before I get the following error:
---------------------------------------------------------------------------
ApplicationDefaultCredentialsError Traceback (most recent call last)
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/gcsfs/core.py in connect(self, refresh)
241 try:
--> 242 data = self.get_default_gtoken()
243 self.tokens[(project, access)] = data
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/gcsfs/core.py in get_default_gtoken()
303 def get_default_gtoken():
--> 304 au = oauth2client.client.GoogleCredentials.get_application_default()
305 tok = {"client_id": au.client_id, "client_secret": au.client_secret,
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/oauth2client/client.py in get_application_default()
1270 """
-> 1271 return GoogleCredentials._get_implicit_credentials()
1272
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/oauth2client/client.py in _get_implicit_credentials(cls)
1260 # If no credentials, fail.
-> 1261 raise ApplicationDefaultCredentialsError(ADC_HELP_MSG)
1262
ApplicationDefaultCredentialsError: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-2-f5bb2896043c> in <module>()
----> 1 fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/gcsfs/core.py in __init__(self, project, access, token, block_size, consistency)
178 self.consistency = consistency
179 self.dirs = {}
--> 180 self.connect()
181 self._singleton[0] = self
182
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/gcsfs/core.py in connect(self, refresh)
267 if 'error' in data2:
268 if i == self.retries - 1:
--> 269 raise RuntimeError("Waited too long for browser"
270 "authentication.")
271 continue
RuntimeError: Waited too long for browserauthentication.
gcsfs
propagates an google.auth.exceptions.RefreshError
when executing many concurrent requests from a single node using the google_default
credentials class. This is likely due to repeated, excessive number of requests to the internal metadata service. This is a known bug of the external library at googleapis/google-auth-library-python#211.
Anecdotally, I've primarily observed this in dask.distributed
workers and believe this might occur due to the way GCSFiles are distributed. This primarily occurs when a large number of small files are being read from storage and many worker threads are performing concurrent reads. I believe the GCSFile
s serialized in dask tasks then each instantiate a separate GCSFilesystem
, resolve credentials and open a session.
If this is the case it would be preferable to store a fixed set of AuthenticatedSession
handles, ideally via cache on the GCSFilesystem
class, and dispatch to an auth-method-specific session in the GCSFilesystem._connect_*
connection functions.
As a more specific solution, google.auth.exceptions.RefreshError
or its base class should be added to the retrying exception list in _call, however this may mask legitimate authentication errors. The credentials should probably be "tested" via some call that does not retry this error during session initialization. This may be as simple as calling session.credentials.refresh
or performing a single authenticated request after session initialization.
It would be valuable to support the standard Text I/O interface in gcsfs for use as a standalone, file-like library. The current behavior requires binary IO, which is not directly compatible with some standard tools (eg. json
).
For example:
import json
import gcsfs
fs = gcsfs.GCSFileSystem(...)
an_obj = {"foo" : "bar"}
with fs.open("bucket/path", "w") as of:
json.dump(an_obj, of)
raises NotImplementedError
.
While it's possible to workaround via io.TextIOWrapper
, it would be best to expose a GCSFile
in text-mode in order to provide support for gcsfs-specific api extensions on the file object.
import json
import gcsfs
fs = gcsfs.GCSFileSystem(...)
an_obj = {"foo" : "bar"}
with fs.open("bucket/path", "wb') as of:
json.dump(an_obj, io.TextIOWrapper(of))
Some operations do not seems to show updated file information on the next query - apparently the out-of-date is being incorrectly kept around.
Feature request (or current workaround)
dict
parameter at file creation (open
in w
mode)I am trying to use gcsfs on a new system. I am invoking it as follows
import gcsfs
fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)
I get the following error, which I really don't understand or know how to recover from. Any advice would be appreciated. This is on a remote server where I have no gloud
utilities installed. The python version is 3.5...don't know if that matters.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-1-90388e4b9890> in <module>()
1 import gcsfs
----> 2 fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in __init__(self, project, access, token, block_size)
162 self.access = access
163 self.dirs = {}
--> 164 self.connect()
165 self._singleton[0] = self
166
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in connect(self, refresh)
229 params={'client_id': not_secret['client_id'],
230 'scope': scope})
--> 231 validate_response(r, path)
232 data = json.loads(r.content.decode())
233 print('Enter the following code when prompted in the browser:')
/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/gcsfs/core.py in validate_response(r, path)
108 raise ValueError("Bad Request: %s\n%s" % (path, msg))
109 else:
--> 110 raise RuntimeError(m)
111
112
RuntimeError: b'{\n "error" : "deleted_client",\n "error_description" : "The OAuth client was deleted."\n}'
Suggestion/request to have a glob option to traverse the full tree e.g.
GCSFS.glob('bucket/**/*.[gz|zip]', recursive=true)
to list all matching files regardless of where located. This mimics glob in Python 3.6.
Workaround I can think of is to use walk() or ls() then regex
Opening as a tracking issue for follow-up discussed #58. I will be able to follow-up on this issue the week of 2018-01-28. From that thread:
[Testing #58] on a GCE instance I use as a dev box. On this machine I've the compute sdk installed and I'm I'm using the service account for application default credentials:
fordas@salish:~/distributed_dev/gcsfs$ gcloud auth list
Credentialed Accounts
ACTIVE ACCOUNT
* [email protected]
In this context auth2client.client.GoogleCredentials.get_application_default
returns an token component that manages access to the service credentials and does not include a refresh_token
, client_secret
. It instead provides access tokens via the get_access_token
interface, which handles token expiration and refresh. It may also be used to retrieve auth headers for requests.
This currently breaks the current master
implementation, which expects a refresh token to be available. This will also break this implementation.
After setting application default credentials to a user account via gcloud auth application-default login
auth2client.client.GoogleCredentials.get_application_default
returns a GoogleCredentials
object exposing refresh_token
et al. This credentials object is capable of handling access token and auth header generation.
Given that behavior, my current recommendation would be to hold the credential object returned by the get_application_default
and use the interface credential.get_access_token().access_token
to manage access. gcsfs
would then not be responsible for managing auth token management and refresh. It would also be desirable to wrap the device credentials generated via the browser
auth mode in a the same oauth handler and remove the token management logic currently implemented in gcsfs
.
In a broader sense, I would suggest emphasizing use of the default app credentials instead of obtaining and directly caching a credential object for gcsfs
. In "off cloud" contexts, the user can then manage authorization via the gcloud auth application-default
command, which provides login
, revoke
. In "on cloud" this then provides easy access to the default service account, with the option for a user to override this setting by gcloud auth login
.
I have a bucket with several directories, each of which has thousands of files. When querying one of the directories that has 3600 files a call to fs.ls("bucket/my-dir")
takes several minutes. I actually don't know how long it takes since I've never let it finish. In contrast, the google client takes 3.4 seconds to return the directory listing [f for f in bucket.list_blobs(prefix='my-dir')]
(the comprehension is to force iteration over the iterator object they return).
This can be nice when the other side is relatively fast
In [1]: import requests
In [2]: %timeit requests.get('https://en.wikipedia.org')
10 loops, best of 3: 116 ms per loop
In [3]: s = requests.Session()
In [4]: %timeit s.get('https://en.wikipedia.org')
10 loops, best of 3: 48.5 ms per loop
It's not clear yet how effective this will be in practice.
I'm seeing intermittent errors using gcsfs==0.3.0
in a dask.distributed
cluster that, I believe, are due to an error in credential resolution. Briefly, when writing a large number of partitions to gcs via the dask.bag.to_textfiles
interface, a number of work partitions failed with the log message:
Enter the following code when prompted in the browser:
....
Traceback (most recent call last):
File "/home/fordas/.conda/envs/dev/lib/python3.5/site-packages/distributed/protocol/pickle.py", line 59, in loads
return pickle.loads(x)
File "/home/fordas/.conda/envs/dev/lib/python3.5/site-packages/gcsfs/core.py", line 597, in __setstate__
self.connect()
File "/home/fordas/.conda/envs/dev/lib/python3.5/site-packages/gcsfs/core.py", line 249, in connect
raise RuntimeError("Waited too long for browser"
RuntimeError: Waited too long for browserauthentication.
However a number of the work partitions completed successfully. This occured while using multithreaded workers, so I assume that some race condition in token retrieval triggered the browser-based authentication flow. Modifying the filesystem interface in dask.bytes.core
appeared to resolve the issue:
import dask.bytes
import gcsfs
import functools
org_fs = dask.bytes.core._filesystems['gcs']
wrapped_fs = functools.partial(org_fs, token="cloud")
dask.bytes.core._filesystems['gs'] = wrapped_fs
dask.bytes.core._filesystems['gcs'] = wrapped_fs
Credential resolution in the default case (i.e. GCSFileSystem(token=None)
) should resolve credentials from the default service account, rather than falling back to the oauth device flow, if the internal metadata service is available. I'm not familiar with the oauth2client.client.GoogleCredential.get_application_default() interface currently used in The master
, but if that interface queries instance metadata by default then this is a "doc bug" to clarify this behavior in the gcsfs interface.oauth2client.GoogleCredential.get_application_default
interface appears to provide a token for the GCE Service account when invoked on a GCE instance and does not profive the "refresh_token" expected by gcsfs
's token refresh logic. This poison's gcsfs
's token cache and blocks authorization if token=None
mode is used from a GCE
instance.
Pending the behavior clarification above, it seems strange to me that dask
would invoke gcsfs
in a mode that could require interactive authorization. I believe gcsfs
should support a headless
authorization mode which resolves credentials from (a) the ~/.gcs_tokens
token cache (b) the gcloud
application default credentials and (c) instance metadata, but provides an informative error if interactive oauth would be required, then register this mode (rather than the current token=None
mode) with dask.
This would allows users to authenticate manually via the current token=None
mode, then utilize the cached token for all further access.
See dask/dask#2207 for dealing with complex URIs. Given that there is no current way to include user/path in a GCS URI, this may just be a no-op, but should be considered.
I regularly access Google cloud services with two different google accounts. I typically switch between them by using gcloud auth login
and then selecting the different account. However, even after doing so the gcsfs
library is finding connection information for my other account. What logic does gcsfs use? I'm interested in tracking down from where it's getting its tokens.
Perhaps by default or with a recursive= keyword
Currently mkdir on a subdirectory fails with a non-informative error. It would be nice if this either worked (if that even makes sense) or else raised an informative "actually you don't need to do this" error.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.