bovi-analytics / chain-simulator Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 20.65 MB

Generic, highly scalable platform for simulating digital twins using Markov chains.

Home Page: https://bovi-analytics.github.io/chain-simulator/

Python 100.00%

chain-simulator's People

Contributors

Stargazers

Watchers

Forkers

sumerudataanalaytics

chain-simulator's Issues

Update `src/chain_simulator/implementations/chain_simulator` to NumpyDoc

Rename `validate_matrix_positive` to `validate_matrix_negative`

The module src/chain_simulator/utilities.py provides a validator named validate_matrix_positive. All this function does is check for negative probabilities in a transition matrix. A more suitable name would be validate_matrix_negative.

Validator `validate_matrix_positive` fails with coo_array

Function chain_simulator.utilities.validate_matrix_positive doesn't work with a Coordinate (COO) array/matrix. Update the function to be compatible with this type and possiby other sparse matrix formats.

Deprecate old NumPy/SciPy/CuPy vector processors in `_simulation` module

There are a few functions that can be deprecated and removed from the _simulation.py module. These functions still use the slow (see #23) matrix-matrix multiplications and vector_processor effectively has the same functionality.

The following needs to be done:

Update `src/chain_simulator/utilities/validate_matrix_positive` to NumpyDoc

Update `src/chain_simulator/implementations/array_assembler` to NumpyDoc

Update and expand `src/chain_simulator/implementations.py` in NumpyDoc

Add caching-mechanism for transition matrices

Update `src/chain_simulator/utilities/validate_matrix_sum` to NumpyDoc

Update API documentation to NumpyDoc

The entiry library is still documented in reStructuredText. A more suitable documentation style is NumpyDoc. The following needs to be converted:

Update repository raneme throughout project

Places to update links:

README.md
setup.cfg
changelog.rst

Optimise matrix multiplications

In cuSPARSE 12, matrix-matrix multiplications consume vast amounts of VRAM. Recent versions of cuSPARSE reworked sparse matrix-matrix multiplication and the current algorithm is very VRAM-hungry. There are several algorithms available (refer to cusparseSpGEMM()), out of which the non-default CUSPARSE_SPGEMM_ALG2 looks most promising.

These are very low-level functions written in C/C++. CuPy makes interacting with the low-level cuSPARSE library possible but is not up-to-date enough to expose CUSPARSE_SPGEMM_ALG2. To access this functionality, Cython can be used, a cuSPARSE wrapper in C/C++ can be written or wait for CuPy to update their wrapper.

In the meantime, vector-matrix multiplications can be used to achieve the same output with a vastly smaller VRAM footprint.

Add support for TensorFlow

Remove deprecated AbstractDigitalTwinFacade

The old AbstractDigitalTwinFacade is deprecated and no longer used. This first implementation for constructing transition matrices is slow, it has a complexity of $O(n^2)$ and is entirely sequential. The implementation calculates the probability of every single cell in transition matrices. This inefficient as most of these probabilities are 0.

https://github.com/Bovi-analytics/DigitalCowSimulationPlatform/blob/308057b0d9b2df0fe23deadadbbd3baedd2f412a/src/chain_simulator/abstract.py#L16

Polish up Sphinx documentation

Todo:

Installation guide
Welcome page
...

Update and expand `src/chain_simulator/init.py` to NumpyDoc

Make step size available as parameter in callback mechanism

Sometimes it may be useful to access the step size of a simulation output. This step size can be used to extrapolate data in a callback function.

Paralellize state change probability calculation

Problem

Currently, users of this library need to calculate state change probabilities themselves. Whether this is done by hand or in code, fact is that users need to provide state change probabilities themselves. This library will eventually make heavy use of parallel processing and even clusters. This library should provide a way to calculate state change probablities in parallel so that this process doesn't become a bottleneck of a single CPU core or single node in a cluster.

Proposed solution

There are currently two libraries contending for operating on a cluster: PySpark and Dask. One of these (or maybe both) will be implemented in the library. Investigate which library will suit this project best and provide hooks via these libraries to calulate state change probabilities.

Validator `validate_matrix_sum` not tested throughout

Function chain_simulator.utilities.validate_matrix_sum is only tested with numpy.ndarray and scipy.sparse.csr_array (refer to tests).

It should be tested against more sparse matrix formats from the SciPy library, namely:

coo_array

Remove deprecated ScipyCSRAssembler

Linked to #7, this object was once used to assemble Compressed Sparse Row (CSR) arrays from the scipy.sparse library. This implementation is replaced with chain_simulator.implementations.array_assembler.

Add GitHub workflow to test documentation

Currently the API documentation and userguide need to be tested manually for errors. When this is forgotten, errors can easily slip in the main/stable branch and cause confusion among end-users. Therefore it would be handy to test the documentation every time the main/stable branch receives an update. This way documentation errors can be spotted immediately.

Add callback mechanism for intermediate state processing

The number of possible paths to traverse through a transition matrix increases drastically as a transition matrix grows in size. This process, in turn, takes longer and longer. Instead of finding possible paths a posteriori, this can be done with each intermediate/final state vector.

To realise this intermediate processing, a callback mechanism needs to be developed. Users should be able to provide functions as calbacks which process intermediate/final state vectors. These callback cuntions need to return ONE variable as these variables are accumulated and summed to produce a final variable.

Concept:

from decimal import Decimal
from typing import Callable, Concatenate, ParamSpec

import numpy as np

P = ParamSpec("P")
T = TypeVar("T", np.ndarray)
U = TypeVar("U", int, float, Decimal)


# Decorator, used to reister callback functions.
def register_callback(Callable[Concatenate[T, P], U]) -> Callable[P, U]:
	...


# Function to call after every n-th matrix multiplication.
@register_callback
def callback(state_vector) -> int | float | Decimal:
	...

Use `warnings.warn` instead of `logging.warn`

What's wrong?

In src/chain_simulator/utilities.py, warnings are communicated to the user using logging.Logger.warning. Based on Logging HOWTO - When to use logging, those warnings should be communicated using warnings. This is because these are cases where the client appication should be modified to rid the warnings.

Resolution

Function validate_matrix_sum should be updated to use warnings:
https://github.com/Bovi-analytics/DigitalCowSimulationPlatform/blob/308057b0d9b2df0fe23deadadbbd3baedd2f412a/src/chain_simulator/utilities.py#L47

And function validate_matrix_positive should be updated too to use warnings:
https://github.com/Bovi-analytics/DigitalCowSimulationPlatform/blob/308057b0d9b2df0fe23deadadbbd3baedd2f412a/src/chain_simulator/utilities.py#L87

It may be handy to define a custom warning, something like InvalidTransitionMatrixWarning?

Todo

Update validate_matrix_sum to use warnings
Add warning to documentation in validate_matrix_sum
Update validate_matrix_positive to use warnings
Add warning to documentation in validate_matrix_positive

Update and expand `src/chain_simulator/utilities.py` to NumpyDoc

Fatal Python error: PyFrame_BlockPop: block stack underflow

Issue: When trying to assemble a transition matrix from a generator, Python crashes.

Environment:

Operating system: Windows (11?)
Python version: 3.10.?

How to reproduce:

Unknown

Traceback:

Fatal Python error: PyFrame_BlockPop: block stack underflow
Python runtime state: initialized
Thread 0x000036bc (most recent call first):
  File "C:\Program Files\Python310\lib\threading.py", line 324 in wait
  File "C:\Program Files\Python310\lib\threading.py", line 607 in wait
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\pydevd.py", line 150 in _on_run
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 218 in run
  File "C:\Program Files\Python310\lib\threading.py", line 1016 in _bootstrap_inner
  File "C:\Program Files\Python310\lib\threading.py", line 973 in _bootstrap
Thread 0x00003614 (most recent call first):
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 292 in _on_run
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 218 in run
  File "C:\Program Files\Python310\lib\threading.py", line 1016 in _bootstrap_inner
  File "C:\Program Files\Python310\lib\threading.py", line 973 in _bootstrap
Thread 0x0000338c (most recent call first):
  File "C:\Program Files\Python310\lib\threading.py", line 324 in wait
  File "C:\Program Files\Python310\lib\queue.py", line 180 in get
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 367 in _on_run
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 218 in run
  File "C:\Program Files\Python310\lib\threading.py", line 1016 in _bootstrap_inner
  File "C:\Program Files\Python310\lib\threading.py", line 973 in _bootstrap
Current thread 0x000013d0 (most recent call first):
  File "C:\Users\gabev\PycharmProjects\DigitalCow\DigitalCow.py", line 718 in state_probability_generator
  File "C:\Users\gabev\PycharmProjects\DigitalCow\venv\lib\site-packages\chain_simulator\implementations.py", line 121 in array_assembler
  File "C:/Users/gabev/PycharmProjects/DigitalCow/main.py", line 123 in chain_simulator_test
  File "C:/Users/gabev/PycharmProjects/DigitalCow/main.py", line 137 in <module>
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18 in execfile
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\pydevd.py", line 1483 in _exec
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\pydevd.py", line 1476 in run
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\pydevd.py", line 2164 in main
  File "C:\Users\gabev\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\213.6777.50\plugins\python\helpers\pydev\pydevd.py", line 2173 in <module>
Extension modules: _pydevd_bundle.pydevd_cython_win32_310_64, _pydevd_bundle.pydevd_cython, _pydevd_frame_eval.pydevd_frame_evaluator_common, _pydevd_frame_eval.pydevd_frame_evaluator_win32_310_64, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering (total: 41)
Process finished with exit code -1073740791 (0xC0000409)

Document installation of optional dependencies

The library offers optional dependencies. These can be installed using pip install chain-simulator[gpu]. This will install the optional dependency CuPy, which needs some special treatment.

As described in CuPy installation, there are some requirements that the user needs to install themselves. This is not clear and should be clearly documented in how to install chain-simulator.

First optimised matrix multiplier

This will be the first optimised version of a transition matrix multiplier. Below is a general flowchart of how the function will work.

The function should be compaible with CuPy sparse matrices, but fall back to a CPU implementation using NumPy/SciPy for matrix multiplications.

Parameters needed:

Transition matrix
Days or steps in time to simulate
Interval for returning intermediary matrices, optional

Function `chain_simulator` progresses one step too far

What's wrong?

The current implementation of src/chain_simulator/_simulation.py:chain_simulator returns a transition matrix which is always progressed one step too far forward in time. The function receives a transition matrix which is already the 'next' step in time. So, the configuration chain_simulator(transition_matrix, 1) should return the unmodified transition matrix.

This becomes apparent when multiplying an initial state vector with the initial transition matrix. The initial transition matrix is the next step in time.

A more elaborate example

Suppose we want to calculate a final state vector for 1 day, 2 days and 3 days in the future, we would do the following:

>>> import numpy as np
>>> initial_state_vector = np.array([1, 0, 0])
>>> initial_transition_matrix = np.array([[0, 1, 0], [0, 1/2, 1/2], [0, 0, 1]])
>>> initial_state_vector @ initial_transition_matrix
np.array([0, 1, 0])
>>> initial_state_vector @ initial_transition_matrix @ initial_transition_matrix
np.array(0, 1/2, 1/2)
>>> initial_state_vector @ initial_transition_matrix @ initial_transition_matrix @ initial_transition_matrix
np.array([0, 1/4, 3/4])

The current implementation gives the following output:

>>> import numpy as np
>>> transition_matrix = np.array([[0, 1, 0], [0, 1/2, 1/2], [0, 0, 1]])  # Actually the 1st step forward in time.
>>> simulator = chain_simulator(transition_matrix, 3, 1)
>>> next(simulator)[0]
np.array([[0, 1/2, 1/2], [0, 1/4, 3/4], [0, 0, 1]])  # Actually the 2nd step forward in time.
>>> next(simulator)[0]
np.array([[0, 1/4, 3/4], [0, 1/8, 7/8], [0, 0, 1]])  # Actually the 3rd step forward in time.
>>> next(simulator)[0]
np.array([[0, 1/8, 7/8], [0, 1/16, 15/16], [0, 0, 1]])  # Actually the 4th step forward in time.

The expected output would look like this:

>>> import numpy as np
>>> transition_matrix = np.array([[0.0, 1.0, 0.0], [0.0, 0.5, 0.5], [0.0, 0.0, 1.0]])
>>> simulator = chain_simulator(transition_matrix, 3, 1)
>>> next(simulator)[0]
np.array([[0, 1, 0], [0, 1/2, 1/2], [0, 0, 1]])  # 1st step forward in time, identical to input matrix.
>>> next(simulator)[0]
np.array([[0, 1/2, 1/2], [0, 1/4, 3/4], [0, 0, 1]])  # 2nd step forward in time.
>>> next(simulator)[0]
np.array([[0, 1/4, 3/4], [0, 1/8, 7/8], [0, 0, 1]])  # 3rd step forward in time.

Add licence

Accumulate intermediate/final state vectors into a sparse matrix

Function state_vector_processor in src/chain_simulator/_simulation.py returns sparse arrays as numpy.ndarray-types. There are currently no sparse formats in the NumPy or SciPy libraries that provide a sparse format for 1d arrays (only 2d arrays/matrices). As the number of states in transition matrices and the number of intermediate state vectors grow, more and more memory will be used. This is wastful as arrays with sparse data are treaded and stored as dense.

To combat this, both intermediate and final state vectors could be stacked incrementally into a matrix-form. This allows the use of sparse formats like CSC and CSR to vastly reduce memory usage. Both formats support efficient arithmetic operations, CSC supports efficient column slicing and CSR supports efficient row slicing.

bovi-analytics / chain-simulator Goto Github PK

chain-simulator's People

Contributors

Stargazers

Watchers

Forkers

chain-simulator's Issues

Problem

Proposed solution

What's wrong?

Resolution

Todo

What's wrong?

A more elaborate example

Recommend Projects

Recommend Topics

Recommend Org