elcorto / psweep Goto Github PK

Loop like a pro, make parameter studies fun.

Home Page: https://elcorto.github.io/psweep

License: BSD 3-Clause "New" or "Revised" License

Python 99.49% Shell 0.51%

parameter-sweep parameter-search parameter-estimation pandas parameter-scan parameter-study python computational-experiment database dask

psweep's Introduction

This package helps you to set up and run parameter studies.

Mostly, you'll start with a script and a for-loop and ask "why do I need a package for that"? Well, soon you'll want housekeeping tools and a database for your runs and results. This package exists because sooner or later, everyone doing parameter scans arrives at roughly the same workflow and tools.

Check the docs here.

psweep's People

Contributors

Stargazers

Watchers

Forkers

rrezaev abdelix

psweep's Issues

Consider dropping `_pset_seq` in favor of `df.index`

Now that we have an int index, we may drop _pset_seq as this is the same as df.index. _pset_seq is set to NaN for parallel runs in run_local() because of random execution order (there's a not too complicated fix, but we didn't have the use case), which is not ideal. So far its only real use case is in prep_batch(), where we use it to enumerate psets in run_{machine.name}.sh, but also only in a comment as a convenience feature.

Setting `database_dir` in `run()` may break simulate workflow

Need to check: We support database_dir != calc_dir but rarely use it so far. Usually assume that they are the same. When run_local(..., database_dir=) is not the default (same as calc_dir) then this will most likely break when we use run_local(..., database_dir=..., simulate=True).

Checkpoint/backup in case of crash

This is a great package! I was just wondering if there is any way in the package that supports checkpoints. Sometimes, there could be a crash/error running at a certain parameter set. It would be nice if there is a checkpoint file so that re-running the parameter sweeping would not re-run the already-run sweep.

Configurable naming of database columns

Add an optional prefix pset_ (such as run(..., pset_prefix="pset_")) to all pset variables such that we have a clear distinction in the db:

book keeping
  _pset_id
  _run_id
  _pset_hash
  ...

pset content
  pset_foo
  pset_bar
  ...

results added by worker() and/or eval scripts. Users can here also add
prefixes as they wish, e.g. `postproc_` or `eval_`.
  baz
  boing
  ...

Guard failing `worker()` calls

Users can catch exceptions and add a _failed field like so:

import traceback

def func(pset):
    ... here be code ...
    return dict(...)

def safe_func(pset):
    try:
        ret = func(pset)
        ret.update(_failed=False, _exc_txt=None)
    except:
        txt = traceback.format_exc()
        print(f"{pset=} failed, traceback:\n{txt}")
        ret.update(_failed=True, _exc_txt=txt)
    return ret

ps.run(safe_func, params)

Maybe make this a feature, e.g. run(..., guard_worker=True).

Issue when using `_time_utc` as df index in parallel runs

Multiuple _pset_ids can end up having the same index if runs start at the same time.

Replace jupyter book

We don't really use all the features of jupyter book and it has a lot of dependencies. Maybe replace with more lightweight standard sphinx + sphinx-book-theme.

doc: use `zip` for equal length plists

Correct README example where we use zip(a,b), add a warning. The example currently drops the last entry in a.

Please add version to package main module

psweep does not seem to follow the convention for storing package version in __version__ variable in the main package scope.
This makes it difficult to distinguish between versions programmatically in user code.

My particular case where this became a problem was with the default database filename changing from 'results.pk' to 'database.pk'. My code used to rename that file from the expected default path to something else. After updating psweep to a newer version the default database location was different then previously. If psweep had __version__ set, I could have handled it with an if statement on the __version__ value.

On a related note, please also consider storing the default value of the database_basename argument as a constant, e.g. DEFAULT_DATABASE_BASENAME.

pickle io: use pandas built-in API

https://pandas.pydata.org/docs/reference/io.html#pickling

Type-related issues in pset hashes

joblib.hash may be too specific for our purposes in some cases, since it is type-sensitive:

# Python int
>>> ps.pset_hash(dict(a=1))
'64846e128be5c974d6194f77557d0511542835a8'
>>> ps.pset_hash(dict(a=int(1)))
'64846e128be5c974d6194f77557d0511542835a8'

# np.int64
>>> ps.pset_hash(dict(a=np.int64(1)))
'4bbb1de2b27b9cfd2f81aa37df3bb3926b2d584d'
>>> ps.pset_hash(dict(a=np.array([1])[0]))
'4bbb1de2b27b9cfd2f81aa37df3bb3926b2d584d'

In the context of a pset, we wouldn't care what the type is, as long as it is some kind of int. But the type sensitivity can cause problems if we read back params from a database, e.g. when repeating workloads for failed psets.

If we pass in ints as in

>>> params = ps.plist("a", [1,2,3])

pandas will cast them such that in a DataFrame, df.a.values will be a numpy array

>>> df.a.values
array([1, 2, 3, 4])
>>> df.a.values.dtype
dtype('int64')

with each entry being int64, but to_dict() in

>>> strip_pset = lambda pset: {k: v for k,v in pset.items() if not k.startswith("_")}
>>> params_from_df = [strip_pset(row.to_dict()) for _, row in df.iterrows()]
>>> type(params_from_df[0]["a"])
int

will cast back to Python ints.

Option to refresh the pset hash

Add a function df_refresh_pset_hash(df, columns=None) to be applied after we manually add a column to the database in eval scripts, where the column name should be part of the pset; columns default should be all which don't start w/ "_". Needs to respect naming defined in #14 as well.

`df.append` deprecated

https://pandas.pydata.org/docs/whatsnew/v1.4.0.html?highlight=append#deprecated-dataframe-append-and-series-append

Use pandas.concat() instead.

Use `dask` to launch cluster jobs

Could be used to replace the custom template file workflow. Neede packages:

dask
dask-jobqueue
distributed

cluster processing: passing the parameter dictionary to each job directly

Hi!
I enjoy using psweep for local sweeps and started looking into the experimental remote cluster processing bits, since I would like to expand my sweeps onto our HPC. I have some questions on how to use the current implementation:

In your example you have a template run.py, which is pretty much what is executed on each machine. You eventually use the substitute method from the string module to replace placeholders for each parameter. I makes perfect sense, but it is a little less convenient than what was possible with the run_local() function, where the worker function is directly given and the parameter dictionary is directly passed to that function. I was hoping to have a similar mechanism available, i.e., where I just get a dictionary and can have my 'template' run.py just deal with a dictionary. This allows to have an actually executable template run.py and I do not have to modify it for each sweep (e.g., when adding a sweep parameter).

The question I have is how to pass such a dictionary to a new script. Do you have a good idea? A straightforward answer could be by pickling it and simply loading the dictionary. Since we want to add the parameters to a pickable database, all entries must be pickable anyway. The difficulty I have now, is that prepare_batch() does the replacement and I cannot access the pset to create the pickled dict myself. Looking at your suggested workflow here, it seems that there used to be a hook for a func(), which I could have used, but it there is none in prepare_batch(). Perhaps I misunderstand something, but could it be that the documentation on the workflow is not quite up to date?

Can't push to pypi because of direct dependency in `pyproject.toml`

Affects all releases later than 0.10.0 .

Reason is https://github.com/elcorto/psweep/blob/0.11.0/pyproject.toml#L46, which leads to pypa/twine#726.
This is only in project.optional-dependencies but still pypi doesn't allow this.

For the time being, use

git clone ...
cd psweep
pip install [-e] .

Add measures to avoid UUID collisions

Even though very unlikely, there may be a small non-zero chance to create a UUID collision when creating run_ids and pset_ids. We should add simple checks against existing IDs in the database and when assigning new ones at call time.