ucl-ccs / easyvvuq Goto Github PK

Python 3 framework to facilitate verification, validation and uncertainty quantification (VVUQ) for a wide variety of simulations.

Home Page: https://easyvvuq.readthedocs.io/

License: GNU Lesser General Public License v3.0

Python 17.75% Makefile 0.01% C++ 0.07% Shell 0.01% Dockerfile 0.01% Jupyter Notebook 82.16% Jinja 0.01%

hpc-applications python simulation uncertainty-quantification verification vvuq

easyvvuq's People

Contributors

Stargazers

Watchers

easyvvuq's Issues

Shared files for Campaigns

We should have a mechanism of specifying resources (files) needed by all runs in a Campaign (I'm thinking of calling them 'fixtures'). Example:
Running MD simulations with the same initial structure and topology but different cutoffs.

A vector of parameters cannot currently be defined in the present json format

Somewhat related to #57

To allow for a "parameter" which is actually a vector of varying parameters we would have to provide e.g. a "vector" type (currently we have e.g. "Real"). But then what would the "min" and "max" fields mean? They could of course just be the physical range of any one element of the vector, but that might not be appropriate for all cases. Most likely we will simply have that those verification fields be conditional on the type specified.

This issue has an effect on the ability to store the campaign object (as a json file) because part of that is storing the parameter space description - the "params" dict. Even if #57 is resolved, we still need a standard way to record these vectors such that the state of the campaign can be saved and reloaded later.

Stochastic collocation (sampling) method from CWI

Wouter just presented a python class he wrote that does stochastic collocation. It essentially generates a list of runs (sets of params), and then the final analysis once they have run.

As presented, it appears to slot into EasyVVUQ almost instantly, so we should really look at incorporating it quickly if possible. I'll need to speak to him about it first (this afternoon), but he says the code is already uploaded to the AUC github repo.

Integration with QCG-PJ

I've just created the qcgpj_integration branch with the new test pce_pj based on a test of pce application, that brings the proposition of integration of EasyVVUQ with QCG-PilotJob Manager. This proposition is external to the current code of EasyVVUQ - everything is contained inside the pce_pj directory.

Since I've developed it using PyCharm and vagrant VM on my personal computer it could be quite difficult to test the solution in a different environment, but for sure you can look at the code. The idea is to have a PJConfigurator object that can store the execution configuration in one process and can load this configuration in the other. Currently it allows to run the encoding and executing phases in parallel using PJ. It was more difficult to do the same for aggregation and therefore it is still realised serially.

I think it would be nice to move some of the code stored in PJConfigurator into EasyVVUQ, but, as you can see, the integration may be in large extent realised externally. What do you think about this solution?

Extending Campaigns and using multiple Campaigns in a UQP

I was originally thinking of two things:

Need to add more runs to enhance sampling (i.e. more replicas to get converged result)
Might want to use different applications in a UQP that use different encoders

Could also have a situation where I run Campaign A and then Campaign B to sample different parameters but I now want to combine in a single analysis.

This also gets me wondering about scanning the runs dictionary to check for certain parameters and return relevant runs.

How to log UQPs/VVPs - inherit from common base classes

Recent changes to *coders used metaclasses to log their existence. The current logging via a campaign is adhoc and probably problematic if you call the UQP from outside EasyVVUQ.

Readers return type is unchecked

At present, an easyvvuq reader will return a list (multidimensional, numpy, if desired).

This returned list is used as the direct input to the analysis UQPs, but there is no checking of it whatsoever. If a user configures their reader slightly wrong, they'll most likely get incomprehensible python errors that won't help them fix it.

I'm thinking that instead of a list, we should return an object which the analysis UQP can query (through whatever methods) to see if it is in the form needed. Either we write this ourselves (not hard) or we use e.g. a pandas dataframe. Pandas is a hefty dependency to add, but in the long run it may be worth it for the tons of extra features.

To reduce the noise at the start of each analysis UQP, I was thinking of writing a decorator (in the style of @accepts) to do whatever input validation there, and leave the rest of the UQP code clean.

What do you think?

Reminder to implement parameter physical range checking

The JSON input file has contained "allowed" parameter ranges for some time, but EasyVVUQ doesn't do anything with that yet. The idea is to immediately catch situations in which (input or output) values fall outside physically acceptable ranges (as specified by the user).

Pandas and analysis as bridge between UQPs and (multiple) Campaigns

I was playing with the idea of aggregating the results of Campaigns via UQPs and pandas. I've created a new branch pandas_start which currently only really adds a to_dataframe method to a Campaign and pandas to the installation requirements.

Doing so bought a couple of things to light which we could/should think about:

We need to be able to reference more complicated results (e.g. mean and uncertainty in multiple variables)
We may need to refer back to UQPs and find out which variables they processed.
The horrible hacks in to_dataframe show that we might need to think about how we organize the information in the results/runs dictionary hierarchy (and params_info).
I really think that the combining of pandas.DataFrames is the right way to go about aggregating Campaigns and the idea put forward by @raar1 of a CampaignManager might be a good way to do this.

collate/aggregate_samples.py has "ugly hack" TODO

"# TODO: Improve this ugly hack"

Just adding this as an issue so we don't forget to address (or come to terms with) this fixture related thing. Relates to #39.

Get tests to pass in master

They seem sound in dev branch

Ideas for the input file

A few ideas that came out of recent discussions I had with @djgroen and @dww100:

The input file should only contain application parameters (i.e. "__wrapper" type variables should be elsewhere. However, this may be fixed by creating a "header" and "params" section in that file.
@djgroen would prefer yaml instead of JSON. @dww100 has expressed a similar preference in the past therefore, in accordance with the will of the people, we shall use yaml instead.
We may need to think more carefully about how/what we define about the parameters. Currently we specify either a static value, a range, or a number of samples drawn from a normal distribution. However, this may presuppose the use of a certain (class of) UQP. Perhaps instead we would just give an indication of the variable type (int, float, string), and some bounds on its value (e.g. > 0) and then let the UQP choose the manner of varying in accordance with its algorithm.
The "splitter" could/should be a function in a python library. This library should also contain functions for validation of the input JSON (soon-to-be yaml), acting as effective enforcement of the API.
A basic "key-value pair + input script template" wrapper should be provided, that would allow 80%+ applications to set themselves up to use this tk without needing to write a single line of code. It would still be possible (and easy) to write your own wrappers for application cases with more complex needs.
What calls the library? Should parts of it be in FabSim, as a plugin (later, when allowed)? To promote wide usage, it is desirable to keep the toolkit distinct from FabSim (it should work standalone if necessary), but that doesn't mean FabSim can't communicate with it natively, in some sense.

Any other discussion points I may have forgotten, just add them below.

Targets for v0.3

Allow extension of existing runs
Versioning of outputs
Combining of campaigns
More complex/realistic encoders/decoders
- Ideally from outside collaborators
Tools to pick up existing datasets

add_default_run() method for campaign class

For the purposes of writing the simplest possible tutorial (e.g. a "helloworld" example) or for a first time test, it would be useful to have an add_default_run() method to Campaign. This would add one run only to the run dict, using only the default values for each param. This avoids needing to go into detail on the sampling stuff before people have even run a single job.

Depth option for fixture path usage

Idea is if we are in Run1/stuff/things/ we need a different path to if we are in Run1/stuff/ when looking for fixtures. Use case is NAMD where the files are looked up relative to the run script and not where the execution is enacted.

Run results checking and run extension

Need a way of:

Not re-running analysis on some runs using some sort of lock in the UQPs
Deleting the lock if add to the sampling/extend a run
- This obviously means we need a way of extending the runs

Confusion on UQP ->primitives -> elements

Looking at the submoduel organization:
I thought we had moved from uqp/analysis/ to primitives/analysis/ etc. But seems we haven't. Are we now just going to rename uqp to elements and bring collate inside that?

Transition all EasyVVUQ distributions to chaospy distributions

We have agreed that having chaospy as a dependency is a very good idea, but now we need to adapt the existing elements so that they all use the chaospy distributions. At present a weird mixture of the old and "new" distros are used.

A potential problem here is that certain types that we want don't exist - e.g. histogram?

Support for gradual collation

As part of the scalability push, and also with the PJM work, we must now expect two things:

Runs will be executed in batches (e.g. 100 at a time) and not all at once
There may be so many runs that it is not viable to store all their output on disk at the same time

This obviously conflicts with the way aggregation is currently done, which expects all run output to be present, and decodes it all into a big dataframe in one go. We therefore need some sort of collation element that can be run on the campaign after each batch, that identifies/decodes/aggregates from those runs which are completed (and in the right order) and updates the output dataframe.

This shouldn't be too hard, but still leaves us with some issues to consider:

We want to be able to "restart" the Campaign, which means the collation element must be able to store its present (partially aggregated) state.
How often do we do this? When the dataframe is massive (>1e6 jobs) the I/O would seem excessive if we did this every batch of 100 jobs, for example. Maybe we'd want to store only every aggregation of 1000 jobs?
If we store by appending to a file, do we want it in binary format to take up less space? (rather than a plain text tsv). Or do we want to store in the database itself using sqlalchemy? How expensive in terms of storage would that be?
The whole point of this gradual approach is to relieve the file system from too many output files, so something would have to be deleting those run directories. Would that be this collation element itself? Or do we want the user to explicitly ask for that action in their script?
Would some sort of pipeline setup be better, in which a collation element is set (just as an encoder, decoder, or sampling element would be) and automatically gets piped the decoded output as and when jobs terminate? I feel we need something like this for the PJM case to look elegant.

Targets for v0.2

Documentation on basic usage
CI
Tools to pick up existing datasets

Should the campaign object be reimagined as a (sort of) finite state machine?

Currently a sampling element, for example, will be created in the user's python script, and then will dump its generated runs into the campaign object. So in this case, the campaign is just a repository for runs, with extra logging capabilities.

This approach is problematic for cases with a very large number of samples (such as Jalal's case, with > 10^6 runs). In such cases, we only want to generate runs e.g. 100 at a time, so they can be encoded-executed-decoded and the output added to the final dataframe.

Adding 100 jobs at a time to the campaign, and then running those, seems clumsy. Especially since, if we want to stop the script part-way through, we lose the state that the sampling element had reached. So we can't restart.

However, if the campaign object was more a sort of finite state machine, then this might be possible. So, it might look more like:

my_campaign.set_encoder(AppEncoder())
my_campaign.set_decoder(AppDecoder())
my_campaign.set_execution_fn(execute)
my_campaign.set_sampler(PCESampler(order=2, ...))
my_campaign.set_analyser(PCEAnalyser())
while my_campaign.more_runs() == True:
        my_campaign.run(100)
my_campaign.analyse()
my_campaign.set_sampler(OtherSampler(blah))
my_campaign.set_analyser(OtherAnalyser())
my_campaign.run_all()
my_campaign.analyse()

So the Campaign object is now always set to be in a particular state (either sampling or analysing) and it runs the elements within itself, rather than those elements being external objects which act on the Campaign.

If we enforce that every EasyVVUQ element must have a "serialize" function implemented, this makes it easier to store the whole campaign object's state at once. Note that now it would also be storing the states of every element working within it.

It would still also perform logging duties etc.

This shape fits a lot more closely with something the PJM might work with, and could make Vytas' database simpler. I think it also makes the python script written by the user look a lot simpler too.

Scalable method of run information storage required

The Campaign as currently designed uses dictionaries and files. In particular the number of files may prove a pain for applications with millions of runs.

Solving this will involve some refactoring and feature additions:

Abstract the run related functions in Campaign (add run etc.)
Change state saving to allow run data storage outside the JSON state file
Implement a scalable backend - initially SQL via sqlalchemy

Later on we will also want to support HDF5 files (main reason for this is supporting the VECMA Fusion use case)

Separate Campaign database into a separate object?

Okay so my thought here is that it maybe that we want to provide the same database interface to both the Campaign and Worker parts of VVUQ workflows.

The idea being that eventually we can separate concerns such that:

Campaign deals with book keeping (logging operations such as sampling etc.) and validation (as such needs to store parameter space description and sampler description, along with access to Database).
Database stores application specific information (Encoder, Decoder, execution command), collation information (target directory for decoded data, aggregation method) and individual Run parameters.

This might in addition mandate a change in the existing database plan (i.e. needs adding a collation table).

As such interface to Database would need:

Setters
app
collation_info(campaign_dir, collation_method)
add_run

Getters
app
encoder : convenience - info in app table
decoder : convenience - info in app table
execution : convenience - info in app table
get_run(run_id)
list_incomplete_runs

Another part of this redesign would be related to the Samplers and selection of variables for inclusion in the sampling. We need to have two lists here variables which are 'variable' in the sense that the user potentially wants it varied and 'varying' meaning currently being assessed through sampling. The the former should be stored in Campaign directly and the latter in the Sampler (which should log its serialised form in Campaign upon update).

Thoughts?

Support distributions not in Chaospy

The two cases we have identified so far are:

Custom distribution from histogram
Discrete distribution (uniform_integer) which we need, for example, to make random seeds

Planning

My idea is to setup three project boards with milestones associated based on 3 point releases aiming to create a version of EasyVVUQ for dissemination and external user testing. To that end I have created three issues for versions 0.1 (#20), 0.2 (#21) and 0.3 (#22).

The idea being we should agree on the targets for each version and then work towards creating and testing them. We should also aim to keep higher level discussion here (i.e. which testing/CI frameworks to use).

Once we decide on a plan I'll create individual tickets for each goal and then associate with the projects/milestones as appropriate.

Local running is horrible - need a better way (or remove entirely)

The title more or less says it all - currently we just write run_cmd from the JSON to file (with some parsing to ensure that the path is absolute). This will die in many situations and doesn't even work for:
python3 myscript.py input_file.

Define the UQP JSON format and input to wrappers

I have adapted the gauss_json wrapper to use one run cut from the multi-run UQP output - there are probably better ways. For example take the full multi-run JSON + run_name.

We should decide how we want this done early.

Refactoring Code for v0.1

Looks like we now have a set of refactoring things which might be best handled prior to v0.1 (such as those arising from #32, #36 & #38). It is worth discussing what we want to do in order to meet goal of mid-month "release" and/or whether a delay is needed in that plan.

Clean implementation of nodes/weights communication between sampling and analysis elements in PCE/Stoch.Coll.

At present both PCE and Stoch. Coll. approaches don't have anywhere standard to store their nodes and weights info which is needed by the analysis step, and the current hack is to effectively monkey patch the campaign object with those vars during execution of the sampling element.

A possible fix is to add a dedicated __weights variable to every individual run dictionary. By definition this would be null, and not used, but would provide a storage space for the PCE/SC sampling elements when necessary.

Would this generally work? Do all such collocation-style methods have a single weight var for each job run for each node?

Create a variable containing the unique ID variable in the campaign class

Would be interested in having a variable containing the ID of the EasyVVUQ campaign which is used to create the campaign directory. It would avoid to have to extract it from the campaign directory name.

Proposal to specify varying parameters directly to sampler rather than in general

At present, we have a list of params (with default values) and then in the script we designate some of them as varying, setting a prob distribution for them. This may actually cause problems when we have to keep turning them on/off (i.e. varying/non-varying) for a more complex sequence of elements. Some elements can only act on specific distributions, and it will be confusing or ill-defined what to do for a given combination of varying params/distributions. I wonder if we might instead simply tell the campaign object all possible parameters (and default values) but then pass the actual variables and corresponding distributions as arguments to the Sampling elements directly. You would then still see what happened with what vars because that would be logged in the campaign object (as info about that element's application).

Need to create a user config file for adding encoders

Maybe using something like appdirs: https://pypi.org/project/appdirs/

Collation/Analysis elements should not be forced to always write the pandas dataframe/log files to disk

This would cause issues for e.g. Jalal with his millions of runs.

We should have sensible logging for all elements

Eventually this should me standard Python logging - allowing warnings, info etc - and the more long term element execution order (now being moved into the campaign database). We can't know how workflows are going to be implemented so messages should be provided that give as much context as possible (albeit we cannot rely on having execution level information like node ID, etc).

Identified issues that need to be dealt with

Version of EasyVVUQ and the CampaignDB should be stored and a warning output if they are different
Elements should log start and finish where appropriate (key example here is for samplers which could conceivably need to be one shot).

Visualisation of results

One of the major primitives currently completely absent from this prototype is any kind of visualisation of the results.

I've found this python library which is built on matplotlib:
https://seaborn.pydata.org/

Seems to do nice visualisations of data. Might be worth looking into.

Error when number_of_replicas variable is set to 1 or 0

My script, based on the example script run_workflow.py, raised an error when I set the number of replicas to 1 or 0. (or even when don't initiate any replicas variable).

The replica's initiation look like the following lines

number_of_replicas = 1 (or = 0)
replicator = uq.elements.sampling.Replicate(my_campaign, replicates=number_of_replicas)
my_campaign.add_runs(replicator)

The (long) error :

Traceback (most recent call last):
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 2670, in agg_series
    return self._aggregate_series_fast(obj, func)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 2690, in _aggregate_series_fast
    result, counts = grouper.get_result()
  File "pandas/_libs/reduction.pyx", line 420, in pandas._libs.reduction.SeriesGrouper.get_result
  File "pandas/_libs/reduction.pyx", line 404, in pandas._libs.reduction.SeriesGrouper.get_result
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 1062, in <lambda>
    f = lambda x: func(x, *args, **kwargs)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/ensemble_boot.py", line 116, in <lambda>
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/ensemble_boot.py", line 92, in bootstrap
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/ensemble_boot.py", line 65, in confidence_interval
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/numpy-1.12.1-py3.6-linux-x86_64.egg/numpy/lib/function_base.py", line 4116, in percentile
    interpolation=interpolation)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/numpy-1.12.1-py3.6-linux-x86_64.egg/numpy/lib/function_base.py", line 3828, in _ureduce
    a = np.asanyarray(a)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/numpy-1.12.1-py3.6-linux-x86_64.egg/numpy/core/numeric.py", line 583, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/series.py", line 767, in __getitem__
    result = self.index.get_value(self, key)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/indexes/base.py", line 3118, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 114, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 958, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 964, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_workflow_nbody.py", line 82, in <module>
    results, output_file = ensemble_boot.apply()
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/base.py", line 116, in apply
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/ensemble_boot.py", line 208, in _apply_analysis
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/ensemble_boot.py", line 122, in ensemble_bootstrap
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 4656, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 4087, in aggregate
    result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/base.py", line 490, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/base.py", line 441, in _agg
    result[fname] = func(fname, agg_how)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/base.py", line 424, in _agg_1dim
    return colg.aggregate(how, _level=(_level or 0) + 1)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 3492, in aggregate
    return self._python_agg_general(func_or_funcs, *args, **kwargs)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 1068, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 2672, in agg_series
    return self._aggregate_series_pure_python(obj, func)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 2703, in _aggregate_series_pure_python
    res = func(group)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/groupby/groupby.py", line 1062, in <lambda>
    f = lambda x: func(x, *args, **kwargs)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/ensemble_boot.py", line 116, in <lambda>
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/ensemble_boot.py", line 92, in bootstrap
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/easyvvuq-0.0.1.dev1-py3.6.egg/easyvvuq/elements/analysis/ensemble_boot.py", line 65, in confidence_interval
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/numpy-1.12.1-py3.6-linux-x86_64.egg/numpy/lib/function_base.py", line 4116, in percentile
    interpolation=interpolation)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/numpy-1.12.1-py3.6-linux-x86_64.egg/numpy/lib/function_base.py", line 3828, in _ureduce
    a = np.asanyarray(a)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/numpy-1.12.1-py3.6-linux-x86_64.egg/numpy/core/numeric.py", line 583, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/series.py", line 767, in __getitem__
    result = self.index.get_value(self, key)
  File "/home_nfs_robin_ib/bmonniern/VECMA/install_EasyVVUQ_intel_19.1.144/lib/python3.6/site-packages/pandas-0.23.4-py3.6-linux-x86_64.egg/pandas/core/indexes/base.py", line 3118, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 114, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 958, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 964, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

Is it a 'real' error or having no replicas makes no sense ? (or something else?)

We need to sort out packaging so that this gets installed usefully

Issues to attend to:

Requirements
Ensure all JSON files transferred correctly
Compile cannonball test code
What do we need to have in a conda channel?

Add a uniform distribution that produces integers

Useful for setting e.g. the seed in lammps

Prototype VVP

The more I look at the library structure so far, the more it seems clear that the UQP/VVP distinction is artificial - helpful for the proposal writing, but not necessarily relevant for implementation.

For example, the uqp/ directory contains two sub dirs - sampling and analysis. This would imply that any VVPs we implement would now go into a new top level dir called vvp/. However, I think a lot of validation primitives would naturally fit into uqp/analysis, or at least have a massive overlap with that.

As I see it, there are four categories:
sampling/ analysis/ comparison/ vis/

Do we insist on the UQP/VVP distinction?

uqp/sampling
uqp/analysis
vvp/comparison
vvp/vis

Should we do away with this split and have each category sit on the top level? Does comparison belong in its own dir at all?

Handling codes which require random seeds

We may need to template in a random variable - or do we offload this issue to the encoder. It certainly presents an issue for generic encoder.

Allow generic encoder to use a symbol other than $

The marker character in the text substitution can be $ by default, but we need to be able to set it to something else too. We're currently making EasyVVUQ work with lammps, but the dollar signs in the input script cause problems with generic encoder.

Programmatic way to specify parameter space

During the hackathon it became apparent that the current way of specifying parameter space (by providing a params dict in the input json file) is problematic for approaches such as @jlakhlili 's PCE case in which a potentially large number of varying parameters may be needed. This can happen e.g. when you need a large vector of parameters.

The most intuitive/logical approach in such cases would be to programmatically construct the params dictionary during the user script.

I propose that we add functionality to Campaign to allow the params stuff to be built in-script.

Elements that act on campaigns, but don't have to

We need to better formalize the dual application of collation and analysis elements to campaigns/not-campaigns. By this I mean that certain elements can only in principle apply to a campaign object (or the algorithm currently only works with that), whereas others (primarily analysis elements) can use a campaign object, a pandas dataframe, a csv file etc as input.

When a campaign is provided, then the element application can be logged with that campaign. Else, we could log it elsewhere (or not). But the point is that the code is getting messy with all these possibilities, and we need a more elegant way to abstract out the data source and logging thereof.

Generic CSV parsing issues (decoder)

Want to allow user to pass args to pandas reader - original idea of kwargs failed.

Check that python version is > 3.6

EasyVVUQ doesn't work (mainly due to format strings etc) for anything less than 3.6.
The errors are incomprehensible though, and no user would understand why it failed.

Priority inclusions and checks for a v0.1 release

Fixtures/file associations: Need a way to deal with common (to multiple runs i.e. #11) and large files.
A very simple VVP - to allow testing only
Some tests
Categorical variables
~~Decide on where aggregation code should be (currently an analysis UQP)~~
Fix issues:
- ~~#24~~
- ~~#25~~

Allow no encoder to be specified (run var info passed to user defined execution function)

In the case of @jlakhlili we currently have a small input file being generated for each run. This file is immediately read by the fortran code, and contains information that could in fact have been passed via commandline arguments. In the case of 1e6 runs, this is an unnecessary overhead.

I propose that the user defined execution function be also passed the same information as the encoder function is (essentially, the dictionary containing all the parameter values necessary for input creation). This would allow the execution command to be generated programmatically from this dictionary directly, therefore passing the values to the simulation code via commandline args.

In such cases you probably wouldn't want to run any encoder at all. We could have something along the lines of: my_campaign.set_encoder(None)

I think this could avoid the creation of millions of tiny input files when they are not actually needed by the simulation code.

Access instance dictionary in _log_analysis

To save the application log after running element specific analysis, the _log_analysis routine, called by apply routine, uses the access instance dictionary by calling __dict__ . This call restrict attribute's typing in BaseAnalysisElement subclasses.
I have cases in PCEAnaylis where I want to use a dictionaries to store descriptive statistics as new attributes. But each item values should be list, tuple of another dictionary. I can't use ndarray for example, otherwise I get the following error:

AttributeError: 'numpy.ndarray' object has no attribute __dict__

Can we handle this problem by some modifications before application log? I suggest also to use Pickle for exapmle, I think that the manipulation of __dict__ is not recommended.

Campaign object should have version and check against database

The Campaign object should have something along the lines of get_version() (or a __version__ member var). This could simply be the current version of EasyVVUQ. This version would also be stored in database to indicate what version created it. That way if a user attempts to continue a campaign using a newer version of EasyVVUQ it will (at the very least) issue a warning along the lines of "WARNING: Loaded database was created with an earlier version (0.1.1) of EasyVVUQ than the current version (0.1.4)" or words to that effect.

License - did we pick one?

If yes - let's add it now, if no lets pick one and then add it now.

After that I think we can make the repo public. Do you agree?

ucl-ccs / easyvvuq Goto Github PK

easyvvuq's People

Contributors

Stargazers

Watchers

Forkers

easyvvuq's Issues

Recommend Projects

Recommend Topics

Recommend Org