cthoyt / pystow Goto Github PK
View Code? Open in Web Editor NEW๐ Easily pick a place to store data for your Python code.
Home Page: https://pystow.readthedocs.io
License: MIT License
๐ Easily pick a place to store data for your Python code.
Home Page: https://pystow.readthedocs.io
License: MIT License
import pystow
from pystow import ensure_csv
MBGM_HOME = pystow.join("matbench-genmetrics")
ensure_csv(MBGM_HOME, url="https://figshare.com/ndownloader/files/36581838")
..\..\..\..\Miniconda3\envs\matbench-genmetrics\lib\site-packages\pystow\api.py:610: in ensure_csv _module = Module.from_key(key, ensure_exists=True) ..\..\..\..\Miniconda3\envs\matbench-genmetrics\lib\site-packages\pystow\impl.py:83: in from_key base = get_base(key, ensure_exists=False) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ key = WindowsPath('C:/Users/sterg/.data/matbench-genmetrics') ensure_exists = False def get_base(key: str, ensure_exists: bool = True) -> Path: """Get the base directory for a module. :param key: The name of the module. No funny characters. The envvar <key>_HOME where key is uppercased is checked first before using the default home directory. :param ensure_exists: Should all directories be created automatically? Defaults to true. :returns: The path to the given :raises ValueError: if the key is invalid (e.g., has a dot in it) """ > if "." in key: E TypeError: argument of type 'WindowsPath' is not iterable
This repo does some analysis, but wasn't able to post their data on github (https://github.com/ky66/ROBIN#data). The data is on figshare at https://figshare.com/ndownloader/files/36477873, seemingly this could be a simple wrapper around ensure()
that just formats in the file number
Related: cthoyt/zenodo-client#6
pystow/src/pystow/config_api.py
Lines 151 to 164 in ecbe7ea
I needed to call cfp.add_section(module)
before I could use cfp.set(module, key, value)
given the file doesn't already exist.
Pystow does not check if there is an existing .data directory on the users system and happily commandeers this folder even if it already exists. Since this is a very common and generic name, it is not unlikely that a user may already have a .data directory in their home folder. It is also not unlikely that a user will end up in a situation where they or some software they have installed other than pystow will try to place a .data directory in their home folder. I suggest to change to a less generic name such as
".pystow_data" to avoid potential naming conflicts. I think just having the possibility of changing the default folder name with an environment variable is insufficient because the direct users of pystow are python package developers not python package users. We should seek to minimize any cognitive burden or sources of surprise for end users of python packages that use pystow.
Hello,
I love the project. However I am missing something like an ensure_zip_file
function. While ensuring that the actual .zip
is there can be done with ensure
, it would be nice to have a functionality, where I ensure a file from this zip is there, and load this file as fileobject, to then read it in however it is needed.
Poking around in the code I found Module.ensure_open_zip
. I think this contextmanager can be easily wrapped for the API, to get what I need like this:
def ensure_open_zip(
key: str,
*subkeys: str,
url: str,
inner_path: str,
name: Optional[str] = None,
download_kwargs: Optional[Mapping[str, Any]] = None,
):
_module = Module.from_key(key, ensure_exists=True)
return _module.ensure_open_zip(*subkeys, url=url, inner_path=inner_path, name=name)
If it's fine with you I could make a PR and add it (with tests and doc of course)?
For functions like ensure_open_zip
, the return value is documented as
:yields: An open file object
so I thought this wold be something I can call e.g., read()
on. Checking the type of the returned value, it's contextlib._GeneratorContextManager
and it took me a little while to figure out that this means I have to use with
to interact with this function. Is the documentation and type annotation for these functions correct?
I just noticed, that the type information of pystow is not shipped, because no py.typed
file is present. See this mypy output I had in another project.
...
sylloge/base.py:28: error: Skipping analyzing "pystow": module is installed, but missing library stubs or py.typed marker [import-untyped]
sylloge/base.py:30: error: Skipping analyzing "pystow.utils": module is installed, but missing library stubs or py.typed marker [import-untyped]
....
This can be easily fixed, by adding a py.typed
file inside src/pystow
.
pystow has methods for syncing with a gzipped file from a URL and dynamically opening it
but if my upstream file is a gzipped sqlite (e.g. https://s3.amazonaws.com/bbop-sqlite/hp.db.gz), then I need it to be uncompressed in my ~/.data folder, before I make a connection to it (the same may hold for things like OWL)
I can obviously do this trivially, but this would require introspecting paths and would seem to defeat the point of having an abstraction layer.
For now I am putting duplicative .db and .db.gz files on s3, and only using the former with pystow, but I would like to migrate away from distributing the uncompressed versions
What I am imagining is:
url = 'https://s3.amazonaws.com/bbop-sqlite/hp.db.gz'
path = pystow.ensure('oaklib', 'sqlite', url=url, decompress=True)
conn = connect("file:///{path}")
Does that make sense?
As an aside, it may also be useful to have specific ensure methods for sqlite and/or sqlalchemy the same way you have for pandas.
If I want to use zenodo:sandbox
as a key, it should figure to look in zenodo.ini
and other zenodo.*
files if they exist
Hello, when using the API function ensure_open_zip
, there is an error thrown, e.g. the following:
import pystow
url = "https://cloud.enterprise.informatik.uni-leipzig.de/index.php/s/LHPbMCre7SLqajB/download/MultiKE_D_Y_15K_V1.zip"
inner_path = "MultiKE/D_Y_15K_V1/721_5fold/1/20210219183115/kg1_ent_ids"
with pystow.ensure_open_zip("kiez", url=url, inner_path=inner_path) as file:
for line in file:
print(line)
break
results in TypeError: '_GeneratorContextManager' object is not iterable
.
Using the module does work.
I figured out how this fix by changing the respective api code to the following:
@contextmanager
def ensure_open_zip(
key: str,
*subkeys: str,
url: str,
inner_path: str,
name: Optional[str] = None,
force: bool = False,
download_kwargs: Optional[Mapping[str, Any]] = None,
mode: str = "r",
open_kwargs: Optional[Mapping[str, Any]] = None,
):
"""Ensure a file is downloaded then open it with :mod:`zipfile`."""
_module = Module.from_key(key, ensure_exists=True)
with _module.ensure_open_zip(
*subkeys,
url=url,
inner_path=inner_path,
name=name,
force=force,
download_kwargs=download_kwargs,
mode=mode,
open_kwargs=open_kwargs,
) as inner_ensure_open_zip:
yield [inner_ensure_open_zip]
``
Currently, pystow provides a number of ensure_*
functions that target different file types, downloads them if not available, and loads them. For instance, ensure_csv
takes a path to a CSV file and loads it with pandas. However, all of these functions also require a url
argument from which the file is first downloaded if it doesn't already exist. I think it would be useful to have variants of these functions where the same functionality of loading canonical file types is provided but without the url/download part, under the assumption that the given file is already there and never necessarily came from a URL in the first place. I would definitely use pickle loading or JSON loading for instance, knowing that a given file is already there.
In the README, the instructions say
Data gets stored in ~/.data by default. If you want to change the name of the directory,
set the environment variable PYSTOW_NAME. If you want to change the default parent directory
to be other than the home directory, set PYSTOW_HOME.
I interpreted this to say that PYSTOW_HOME
can be configured to /path
to make it create its .data
folder in /path/.data
. However, this made it put the individual project folders into /path/project1
, /path/project2
, etc. I think it would make sense to either interpret PYSTOW_HOME
as the parent of .data
and PYSTOW_NAME
as the name for the .data
folder, or change the instructions above to describe the current behavior (i.e., instead of "default parent directory to be other than the home directory" say "default pystow data directory to be other than ~/.data")
Assume I start with
workdir = pystow.join('part1', 'part2', ..., 'partn')
This guarantees that workdir
will be created if it doesn't yet exist. However, if I now want to do make a subfolder inside workdir, I either have to do
work_subdir = workdir.joinpath('part_nplusone')
which doesn't create the folder for me and so I have to do additional bookkeeping to make that happen or do
workdir = pystow.join('part1', 'part2', ..., 'partn', 'part_nplusone')
which is redundant.
What is the recommended approach here?
Optionally behind a verbose
parameter, but it'd be nice in some cases for the user to be aware that downloads are happening in the background.
My workflow involves periodic
ls -alt ~/.data/oaklib
followed by looking at timestamps, using tacit knowledge about update frequencies of different ontologies, and selectively removing older files.
But if pystow is used in toolchains used by less technical users this could be confusing. What are the long term plans here? Should application developers write bespoke cache management solutions? This is not a bad idea as they can take advantage of specific conventions (e.g. I am caching sqlites of ontology files and I know versionIRI, when present, should uniquely identify the version). But it may be useful to have some kind of general purpose cache management helpers in the core, together with some kind of autoflush-after-N-days type options?
These are considered as files by pathlib. Make name of file more explicit and keyword only.
After reading through https://en.wikipedia.org/wiki/Comma-separated_values, I think I can understand the decision behind making tab separators the default as "the only safe option," though it does seem confusing to me that ensure_csv
assumes sep="\t"
. Maybe worth mentioning the default pd.to_csv()
uses commas (no tabs).
Lines 1414 to 1417 in 2ce9690
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.