elcorto / psweep Goto Github PK
View Code? Open in Web Editor NEWLoop like a pro, make parameter studies fun.
Home Page: https://elcorto.github.io/psweep
License: BSD 3-Clause "New" or "Revised" License
Loop like a pro, make parameter studies fun.
Home Page: https://elcorto.github.io/psweep
License: BSD 3-Clause "New" or "Revised" License
Affects all releases later than 0.10.0 .
Reason is https://github.com/elcorto/psweep/blob/0.11.0/pyproject.toml#L46, which leads to pypa/twine#726.
This is only in project.optional-dependencies
but still pypi doesn't allow this.
For the time being, use
git clone ...
cd psweep
pip install [-e] .
Even though very unlikely, there may be a small non-zero chance to create a UUID collision when creating run_ids and pset_ids. We should add simple checks against existing IDs in the database and when assigning new ones at call time.
Could be used to replace the custom template file workflow. Neede packages:
dask
dask-jobqueue
distributed
psweep does not seem to follow the convention for storing package version in __version__
variable in the main package scope.
This makes it difficult to distinguish between versions programmatically in user code.
My particular case where this became a problem was with the default database filename changing from 'results.pk' to 'database.pk'. My code used to rename that file from the expected default path to something else. After updating psweep to a newer version the default database location was different then previously. If psweep had __version__
set, I could have handled it with an if statement on the __version__
value.
On a related note, please also consider storing the default value of the database_basename
argument as a constant, e.g. DEFAULT_DATABASE_BASENAME
.
Use pandas.concat()
instead.
joblib.hash
may be too specific for our purposes in some cases, since it is type-sensitive:
# Python int
>>> ps.pset_hash(dict(a=1))
'64846e128be5c974d6194f77557d0511542835a8'
>>> ps.pset_hash(dict(a=int(1)))
'64846e128be5c974d6194f77557d0511542835a8'
# np.int64
>>> ps.pset_hash(dict(a=np.int64(1)))
'4bbb1de2b27b9cfd2f81aa37df3bb3926b2d584d'
>>> ps.pset_hash(dict(a=np.array([1])[0]))
'4bbb1de2b27b9cfd2f81aa37df3bb3926b2d584d'
In the context of a pset
, we wouldn't care what the type is, as long as it is some kind of int. But the type sensitivity can cause problems if we read back params
from a database, e.g. when repeating workloads for failed psets.
If we pass in ints as in
>>> params = ps.plist("a", [1,2,3])
pandas
will cast them such that in a DataFrame
, df.a.values
will be a numpy array
>>> df.a.values
array([1, 2, 3, 4])
>>> df.a.values.dtype
dtype('int64')
with each entry being int64, but to_dict()
in
>>> strip_pset = lambda pset: {k: v for k,v in pset.items() if not k.startswith("_")}
>>> params_from_df = [strip_pset(row.to_dict()) for _, row in df.iterrows()]
>>> type(params_from_df[0]["a"])
int
will cast back to Python ints.
Multiuple _pset_id
s can end up having the same index if runs start at the same time.
Now that we have an int index, we may drop _pset_seq
as this is the same as df.index
. _pset_seq
is set to NaN for parallel runs in run_local()
because of random execution order (there's a not too complicated fix, but we didn't have the use case), which is not ideal. So far its only real use case is in prep_batch()
, where we use it to enumerate psets in run_{machine.name}.sh
, but also only in a comment as a convenience feature.
This is a great package! I was just wondering if there is any way in the package that supports checkpoints. Sometimes, there could be a crash/error running at a certain parameter set. It would be nice if there is a checkpoint file so that re-running the parameter sweeping would not re-run the already-run sweep.
Hi!
I enjoy using psweep for local sweeps and started looking into the experimental remote cluster processing bits, since I would like to expand my sweeps onto our HPC. I have some questions on how to use the current implementation:
In your example you have a template run.py, which is pretty much what is executed on each machine. You eventually use the substitute method from the string module to replace placeholders for each parameter. I makes perfect sense, but it is a little less convenient than what was possible with the run_local()
function, where the worker function is directly given and the parameter dictionary is directly passed to that function. I was hoping to have a similar mechanism available, i.e., where I just get a dictionary and can have my 'template' run.py just deal with a dictionary. This allows to have an actually executable template run.py and I do not have to modify it for each sweep (e.g., when adding a sweep parameter).
The question I have is how to pass such a dictionary to a new script. Do you have a good idea? A straightforward answer could be by pickling it and simply loading the dictionary. Since we want to add the parameters to a pickable database, all entries must be pickable anyway. The difficulty I have now, is that prepare_batch()
does the replacement and I cannot access the pset
to create the pickled dict
myself. Looking at your suggested workflow here, it seems that there used to be a hook for a func()
, which I could have used, but it there is none in prepare_batch()
. Perhaps I misunderstand something, but could it be that the documentation on the workflow is not quite up to date?
Add an optional prefix pset_
(such as run(..., pset_prefix="pset_")
) to all pset variables such that we have a clear distinction in the db:
book keeping
_pset_id
_run_id
_pset_hash
...
pset content
pset_foo
pset_bar
...
results added by worker() and/or eval scripts. Users can here also add
prefixes as they wish, e.g. `postproc_` or `eval_`.
baz
boing
...
We don't really use all the features of jupyter book and it has a lot of dependencies. Maybe replace with more lightweight standard sphinx + sphinx-book-theme.
Correct README example where we use zip(a,b)
, add a warning. The example currently drops the last entry in a
.
Add a function df_refresh_pset_hash(df, columns=None)
to be applied after we manually add a column to the database in eval scripts, where the column name should be part of the pset; columns default should be all which don't start w/ "_". Needs to respect naming defined in #14 as well.
Need to check: We support database_dir
!= calc_dir
but rarely use it so far. Usually assume that they are the same. When run_local(..., database_dir=)
is not the default (same as calc_dir
) then this will most likely break when we use run_local(..., database_dir=..., simulate=True)
.
Users can catch exceptions and add a _failed
field like so:
import traceback
def func(pset):
... here be code ...
return dict(...)
def safe_func(pset):
try:
ret = func(pset)
ret.update(_failed=False, _exc_txt=None)
except:
txt = traceback.format_exc()
print(f"{pset=} failed, traceback:\n{txt}")
ret.update(_failed=True, _exc_txt=txt)
return ret
ps.run(safe_func, params)
Maybe make this a feature, e.g. run(..., guard_worker=True)
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.