llnl / ams Goto Github PK

View Code? Open in Web Editor NEW

3.0 5.0 7.0 3.51 MB

License: Apache License 2.0

CMake 8.20% Dockerfile 1.14% Python 33.89% Shell 3.41% C++ 53.36%

machine-learning hpc

ams's Introduction

Autonomous MultiScale Library (AMS)

A library (under construction) to simplify machine learning surrogate model integration in HPC codes.

Getting Involved

AMS is an open-source project, and we welcome contributions from the community.

Contributions

We welcome all kinds of contributions: new features, bug fixes, documentation edits; it's all great!

To contribute, make a pull request, with develop as the destination branch.

Authors

Thanks to all of AMS contributors.

AMS was created under the AMS LDRD-SI project (22-SI-004).

Citation

If you use this software, please cite it as below:

Bhatia, Harsh, Patki, Tapasya A., Brink, Stephanie, Pottier, Loïc, Stitt, Thomas M., Parasyris, Konstantinos, Milroy, Daniel J., Laney, Daniel E., Blake, Robert C., Yeom, Jae-Seung, Bremer, Peer-Timo, and Doutriaux, Charles. Autonomous MultiScale Library. Computer Software. https://github.com/LLNL/AMS. US DOE National Nuclear Security Administration (NNSA). 01 May. 2023. Web. doi:10.11578/dc.20230721.1.

or get the format you prefer from here

Release

AMSLib is released under Apache License (Version 2.0) with LLVM exceptions. For more details, please see the LICENSE

LLNL-CODE-851455

Installation Instructions

See INSTALL.md for full instructions.

ams's People

Contributors

Stargazers

Watchers

Forkers

milroy alizalisan tpatki r-yin ggeorgakoudis jaeseungyeom slabasan

ams's Issues

Allow surrogate-model inference using single precision.

We currently always expect the model to use the same precision as the physics code. This effectively can limit the performance gains of the model. We should allow casting inputs to single precision and then cast back to double.

AMS workflow python package installation warning

Make install prints this:

Installing collected packages: argparse, ams-wf                                                                                          
  Attempting uninstall: ams-wf            
    Found existing installation: ams-wf 1.0
    Uninstalling ams-wf-1.0:
      Successfully uninstalled ams-wf-1.0
  DEPRECATION: ams-wf is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the '
wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' optio
n. Discussion can be found at https://github.com/pypa/pip/issues/8559

Besides the warning, I observed also unnecessary re-installs of the package.

The miniapp ams-example segfaults with 10 number of elements

Running as

LIBAMS_VERBOSITY_LEVEL=-1 /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/examples/ams_example --precision single --uqtype deltauq-mean -db ./db -S /p/vast1/ggeorgak/projects/ams/AMS/tests/tuple-single.torchscript -e 10

Segfaults with output:

--> cycle: 1
 material 0: using sparse packing for -5 elems
[ Workflow ] Entering Evaluate with problem dimensions [(-320, 2, -320, 4)]
[ Workflow ] Memory usage at Start is VM:1.37981e+06 RS:141504
[ ResourceManager ] Requesting to allocate -320 values using allocator :mmp-host-quickpool
terminate called after throwing an instance of 'umpire::util::Exception'
  what():  ! Umpire Exception [/var/tmp/blake14/spack-stage/spack-stage-umpire-2022.03.1-zzq5wck6qqis42cbreyywthqbw2vcb2j/spack-src/src/umpire/alloc/MallocAllocator.hpp:43]:  allocate malloc( bytes = 18446744073709551312 ) failed
    Backtrace: 10 frames
    0 0x1555550c6a07 No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/src/libAMS.so(_ZN6umpire5alloc15MallocAllocator8allocateEm+0x527) [0x1555550c6a07]
    1 0x1555550c744e No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/src/libAMS.so(_ZN6umpire8resource21DefaultMemoryResourceINS_5alloc15MallocAllocatorEE8allocateEm+0x4e) [0x1555550c744e]
    2 0x1555550ad940 No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/src/libAMS.so(_ZN6umpire8strategy6mixins17AlignedAllocation16aligned_allocateEm+0x20) [0x1555550ad940]
    3 0x15555505f036 No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/src/libAMS.so(+0x3b036) [0x15555505f036]
    4 0x40cd44 No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/examples/ams_example(_ZN6umpire9Allocator8allocateEm+0x304) [0x40cd44]
    5 0x15555506c355 No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/src/libAMS.so(_ZN3ams15ResourceManager8allocateIbEEPT_m15AMSResourceType+0x65) [0x15555506c355]
    6 0x15555507cd2d No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/src/libAMS.so(_ZN3ams11AMSWorkflowIfE8evaluateEPviPPKfPPfiii+0x1cd) [0x15555507cd2d]
    7 0x407976 No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/examples/ams_example() [0x407976]
    8 0x15554ac1dd85 No dladdr: /lib64/libc.so.6(__libc_start_main+0xe5) [0x15554ac1dd85]
    9 0x4097ee No dladdr: /p/vast1/ggeorgak/projects/ams/AMS/build_ruby/examples/ams_example() [0x4097ee]

Debugger (lldb):

(lldb) 
frame #9: 0x000015555507cd2d libAMS.so`ams::AMSWorkflow<float>::evaluate(this=0x000000000227fcd0, probDescr=0x000000000226fd00, totalElements=-320, inputs=<unavailable>, outputs=<unavailable>, inputDim=2, outputDim=4, Comm=1140850688) at workflow.hpp:333:65
   330        return;
   331      }
   332      // The predicate with which we will split the data on a later step
-> 333      bool *p_ml_acceptable = ams::ResourceManager::allocate<bool>(totalElements);
   334 
   335      // -------------------------------------------------------------
   336      // STEP 1: call the hdcache to look at input uncertainties
(lldb) p totalElements
(const int) $0 = -320
(lldb) up
frame #10: 0x0000000000407976 ams_example`main(argc=<unavailable>, argv=<unavailable>) at main.cpp:636:32
   633 
   634  #ifdef USE_AMS
   635  #ifdef __ENABLE_MPI__
-> 636            AMSDistributedExecute(workflow[mat_idx],
   637                                  MPI_COMM_WORLD,
   638                                  static_cast<void *>(eoses[mat_idx]),
   639                                  num_elems_for_mat * num_qpts,
(lldb) 
frame #11: 0x000015554ac1dd85 libc.so.6`__libc_start_main + 229
libc.so.6`__libc_start_main:
->  0x15554ac1dd85 <+229>: movl   %eax, %edi
    0x15554ac1dd87 <+231>: callq  0x15554ac34380            ; exit
    0x15554ac1dd8c <+236>: movq   0x8(%rsp), %rax
    0x15554ac1dd91 <+241>: leaq   0x14b07d(%rip), %rdi

Simplify cmake file for tests

Currently the cmake file replicates building binaries for testing that are identical. We should update building to avoid that for speedp.

Provide a common interface for UQ + Surrogate across all UQ methods

We need an interface to support different types of UQ + Surrogates. We currently have 3 UQs, whith their code convoluted. We need a single entry point. This is necessary for us to provide clean support of different precisions for the surrogate implementations.

Proper error handling in RMQ DB.

Currently when errors happen in the AMS lib RMQ database we tend to completely fail. We need to reliably continue execution and re-establish the connection.

This should become after #30 is merged in.

Include PFA build in CI tests

As discussed with @koparasy last week, opening an issue so we remember to include PFA build in the CI tests, so we detect issues early.

Stager and Orchestrator communicate with RMQ through a different interface/API

Provide a common RMQ interface for both AMS-stager and AMS-orchestrator.

Create clang-format checker in CI

Create an end-to-end, whole workflow test

Adiak is not installed in our CI container

Our CI container does not include adiak. The current requested adiak version is not available in the spack packages. Please update both the container and correct cmake and the example code here to use the same macro guard (__AMS_ENABLE_ADIAK__)

Update model

As pointed by @ggeorgakoudis we are not passing strings as references in the case of updating a model. We should do so.

Additionally in the case of an update FAISS UQ we should throw an error

AMSspack install issue with git repo

Hi all

I am trying to install AMS using spack 0.23.0. It gets through its dependencies but halts on an error about openssl when trying to pull down AMS itself. In the package.py file you have

git = "[email protected]:LLNL/AMS.git"

When I try replacing this with

git = "https://github.com/LLNL/AMS.git"

It gets as far as downloading the module, but then fails on compilation because cmake cannot find Umpire (which was earlier installed with spack).

Is this a known issue? Is there some sort of permissions problem I might be having with github? Are the repo permissions set ok for people outside the dev team to be able to pull it down?

Implement CUDA run tests

The proposed approach is to hook with LLNL internal gitlab to run the tests on LLNL GPU-enabled runners.

Workflow orchestrator

We are missing a mechanism to bootstrap FLUX and correctly connect all the components. There are bash scripts under ./scripts/ that perform some of the actions as a guideline. We need to correctly port them into python and abstract out the specifics of the mini-app.

Read and test cuda return values

We are not always looking at the return value of cuda api invocations. We should fix this.