Git Product home page Git Product logo

Comments (18)

dww100 avatar dww100 commented on June 18, 2024

In principle I like this idea a lot.

What does the new database look like?

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

My idea is that the database will have a new, single slot for storing the element that is currently being applied.

So when you do my_campaign.set_element(PCESampler()) then Campaign.current_element is set to the newly created PCESampler instance. And since all elements will be forced to implement a .serialize() function (or equivalent), then whenever the campaign saves its state to database, then the exact current state of the element is saved as well.

This means Campaign databases can act as effective "restart" files for the VVUQ workflow. If the sampler only got through 53 runs out of a total of 2700 runs, then it will be reinitialised in whatever manner allows it to continue from that point. I expect Analysis elements to have their states similarly stored and loaded from the database.

Campaign objects only have one element slot because only one element can be active at any one time.

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

Based on discussions with @dww100 (and on the work of @orbitfold with the DB backend) I think the base element definition will now have three extra functions:

serialize(), deserialize() and is_restartable()

The first two are to be used to store the (current) state of the element in the database, and to restart from that stored state if needed. is_restartable() will return a bool to indicate whether it is possible to restart such an element.

It is my expectation that completely stochastic sampling elements should be able to save their state easily (for example, simply how many more draws from the distribution they have left). Similarly, Stochastic Collocation and PCE both seem to generate their nodes/weights in a pre-determined sequence, so such a sampling element should be able to store what iteration it reached, and restart from there.

Most analysis elements, by their very nature, will likely not be restartable. If the HPC job fails half way through an analysis element, the campaign will simply have to start over for the analysis step.

from easyvvuq.

dww100 avatar dww100 commented on June 18, 2024

Should is_restartable be a function or an attribute?

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

I was hoping to put it in the element base.py as e.g.

def is_restartable(self):
    raise NotImplementedError

so anyone making a new element is forced to make it return something. I'm not sure how to enforce that if it's just a variable with (presumably) default value.

from easyvvuq.

dww100 avatar dww100 commented on June 18, 2024

My feeling is that by default it should just be is_restartable = False. The assumption being that if you are implementing a one short sampler that you would know what you are doing.

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

OK, that's fine by me. But either way we should probably make it consistent with the way versions are handled for elements currently.

from easyvvuq.

dww100 avatar dww100 commented on June 18, 2024

@djgroen You should have a quick review of this to see that it makes sense to you.

from easyvvuq.

dww100 avatar dww100 commented on June 18, 2024

@raar1 I think we should write up something that looks like a design document based on this. The one issue I think I would like clarified is the execute step. My understanding is that this would not be necessary - i.e. I could:

  1. Sample and Encode a load of runs, save Campaign state, finish script.
  2. Run however I want (i.e. hand rolled batch script, FabSim, small Dask script, QCG, whatever).
  3. Recreate Campaign from saved state in new script, run analysis.
  4. Celebrate great victory.

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

I agree with drafting the design document. Here's a first stab (very preliminary) at some pseudocode that kind of fits the general idea:

# User defined execution function. Has to accept run_dir (campaign gives it this)
# but doesn't have to use it.
def user_def_exec_fn(run_dir):
    os.system("cd " + run_dir + " && simulation_code\n")

# Set encoder, decoder, aggregator and user defined execution function for this campaign
my_campaign.set_encoder(EncoderGeneric(delimiter="#"))
my_campaign.set_decoder(DecoderCSV(output_filename=output_filename, output_columns=output_columns)
my_campaign.set_aggregator(Collate())
my_campaign.set_execution(ExecuteLocal(user_def_exec_fn))

# Set the campaign to use a sampling element
sampler = uq.elements.sampling.RandomSampler(num=number_of_samples)
my_campaign.set_element(sampler)

# Run batches of 100 jobs.
while my_campaign.has_runs_remaining() == True:
    my_campaign.run(100)
    my_campaign.aggregate()

# Analysis
...

Note that the execution function is completely user defined, and can contain anything. The campaign would merely pass relevant info to that function (that the user wouldn't know), such as where that specific run directory has been placed, but the user doesn't have to use any of it. Then if you ask the campaign to run 100 jobs, it will automatically sandwich that call in between the encoding and decoding steps, and clean up files afterwards.

If you prefer, we could do this in a more fine-grained way, in which the user has to explicitly call the encoding and decoding steps, but note that if they call the decoder themselves then we can't have it (optionally) automatically remove files upon completion because we don't know at what point they choose to call the decoder.

I suppose I really want to avoid the "encode everything -> execute everything -> decode everything -> aggregate everything" approach we currently have, since it just isn't scalable. I was hoping to enforce (somehow) that a "run" would always encode-execute-decode in one inseparable triplet. Running the decoder as part of the run means we immediately get a distillation of what we want as soon as the execution step terminates, so we can optionally delete the rest of the output files right away (assuming no errors occurred) in cases where space is tight.

We also need to think about how/when the aggregator is called, as it will need to work in a gradual fashion.

Any thoughts? Can we stick to pseudocode for now, just because I find it easier to evaluate what design choices will mean when I see what the final script might look like.

from easyvvuq.

dww100 avatar dww100 commented on June 18, 2024

So the encode and decode happen as part of campaign.run()?

Either way I'd rather not have a user create an essentially dummy function - maybe have one as the default execution?

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

I was hoping to have encode and decode essentially happen as part of run(), yes. But I'm not sure how best to do this, especially in the PJM case where we want to make the encoding/decoding happen "in parallel". This would require encode, for example, to be some sort of standalone script? How do we pass run info to it, without effectively doing a ton of file coupling?

I'm not too sure what you mean by a dummy function - the user needs to specify what should happen for execution somehow, no? So I would always expect that function to contain some kind of code? Or have I misunderstood your point?

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

Perhaps for the standalone encode problem, we could go back to having an if __name__ == "__main__": block that runs using commandline arguments when it detects the encoder is being called as a separate script. This would work with the PJM case. But we'd need to think about how we're telling it what it should "encode" - do we pass a file name to a dict/JSON? Or do we pass the name of the campaign, and it has to look in the database? None of these seem elegant...

from easyvvuq.

djgroen avatar djgroen commented on June 18, 2024

Hi guys,

Just my two cents, but I think "run" can be a bit ambiguous. It could refer to just executing a job, or to running the whole application. Perhaps it's better to have a separate "execute" function which only does the execution step, and a "do_campaign" function that does encode, execute, decode and aggregation?

Feel free to bully or ignore me if this is off-target, but I thought it could be worthwhile to at least articulate this thought somewhere ;).

from easyvvuq.

bartoszbosak avatar bartoszbosak commented on June 18, 2024

@raar1 I think that if we look at EasyVVUQ from a bit different perspective and if we assume that the campaign can be easily recovered from DB and used to execute/run each of the steps (e.g. encoding, execution, decoding,..) by "worker" processes independently, the actual processing may be moved to Pilot Job or FabSim3. It would just require definition of common interface for standalone processes, but it seems feasible.

I also vote for @djgroen proposition to have "do_campaign" rather than "run".

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

@bartoszbosak OK I think we're all essentially arguing for the same thing, but what form the interface should take is still ambiguous. Let's say we have N worker processes, what should happen? I can see at least two different classes of approach:

  1. Each (individual) process loads the campaign DB and then starts running jobs (encoding, execution, decoding) but how does it know which runs are assigned to it? This could be predetermined in the single-threaded region of the workflow, or work dynamically. In the latter case the database would have to be capable of parallel writes to keep the database up-to-date with what jobs have been "claimed".

  2. There is a script running in a single thread that is farming out one run at a time to whichever worker processes are free (obviously the PJM handles this). I believe this is essentially what the example pseudocode presented by PSNC suggested (right?). In such a case, we would have to specify a user-defined execute() function, that would simply submit a job to the PJM to run a small script that itself does encoding, simulation execution and decoding within it. This script could be made very concise with a do_campaign() function as @djgroen suggests, although I find that name misleading too and would prefer a different one (I think aggregation will need to be done in the single threaded region).

I suppose we will, at least at this stage, opt for the type of approach detailed in 2? That's certainly fine with me, but I don't necessarily want to rule out other execution patterns, seeing as this is an HPC project. Also, I certainly do want to push this processing onto the middleware (QCG, FabSim, whatever) and that has always been the goal, but it needs to be done in a controlled and formalized way. I find the approach in 2 to be still quite hacky, and would prefer there to be some kind of formal execution/middleware plugin class that can have different implementations depending on the middleware being used.

from easyvvuq.

bartoszbosak avatar bartoszbosak commented on June 18, 2024

Hi @raar1. I think that our idea is a mix of your two ideas;-) Let's give our idea the number 3:

  1. As in your proposition 2. there is one master script written with the help of preferred tool, e.g. with QCG-PJ or FabSim, that manages the whole experiment. In a case of QCG-PJ it is started as the first Pilot Job task. It initiates the campaign object and then submits a number of subsequent tasks (processes) that are responsible for encoding, execution, encoding etc. I can imagine that depending on particular needs the processes can have different granularity (e.g. the encoding, execution & encoding can be joined or processed independently, the execution task can take many runs or just a single run and so on). Master script can also do some other processing after the parallel execution step. I think that it is doable to keep information about the way how the elementary tasks are performed on a level of execution middleware, but having the DB, it seems to be more optimal to left this task to EasyVVUQ. In this case each of the tasks should be allowed to read the campaign object from the DB and to write this object (or at least information about particular run) to the DB. This seems to be the thing that you've proposed in the version 1. There is a question to @orbitfold if it is possible, but I hope it can be.

I think that this a bit relaxed interface of EasyVVUQ wouldn't complicate the usage of the tool by its users (the high-level methods can be still available), but will allow to optimize the execution of complex and resource-demanding workflows on a level of particular execution middleware, which seems to be a good place for this job,

from easyvvuq.

raar1 avatar raar1 commented on June 18, 2024

The PJM integration discussion is now moved to issue #73

from easyvvuq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.