astatide / legion Goto Github PK

Repo for exploratory work on a BD (Brownian Dynamics) engine in Chapel

Chapel 88.94% Makefile 3.14% Dockerfile 4.90% Nix 3.02%

legion's Introduction

Legion

README First Public Draft

Legion is a heavily-work-in-progress MD/Brownian simulation engine / framework package. The design goal is to be scalable, modifiable, and most of all, easy to read; an experienced programmer, lacking knowledge of chemistry, should reasonably be able to run an MD simulation. Conversely, a non-programmer who knows chemistry should be able to infer what is happening merely by reading source code, relying instead on their knowledge of math. Finally, an experienced programmer/chem-aware individual should be able to create/play around with experimental/educational simulation cores to determine where and why something is done in a simulation.

Extensibility / correctness is an important design goal; ML integration via an embedded Python interpreter is planned (ala a SmartSim integration).

Much of the groundwork is laid, but there's still more to do. Overall, the lower level work is to enable a high-level API that can be easily re-arranged to create custom simulation algorithms/routines/cores. Therefore, heavy use of operator overrides and functions ala functional programming (mimicking the style of function one might use in an equation) are utilized.

Note that literally everything is under heavy construction and is not yet at the alpha stage.

Dynamics Cores

There are currently plans for 3 built-in dynamics cores:

LAMIA, a Linear Atomic/Molecular Integration Algorithm: the basic, "correct" integrator that does not utilize any parallel speedups (except for those which Chapel provides natively). This dynamics core prioritizes readability, simplicity, and correctness above all else, and will serve as the benchmark against which other cores should be judged. It is in progress, and will support ML/Python integration for online analysis/experimentation/visualization. Yes, the acronym meaning is a backronym.
UNNAMED: single-node performant dynamics core. This will not be multi-node, but will prioritize performance over readability (while still ensuring that any data transformations done from a high level API are explicitly noted). Lamia may ultimately fit this bill, meaning it may never exist.
BEHEMOTH, a massively scalable dynamics core: this is designed to be performant in the scaling sense, with the goal to serve Incredibly Large Simulations (cell scale? Multi-protein?) for educational or research purposes. Not yet started.

Force Fields

In addition, Legion has an (in-progress) built in force field and force field API, named SIn/SInFF (you may stylize however you see fit): the Sorta-Inaccurate Force Field. Sin serves two purposes:

An example of the internal force field interface.
A template for training of any sort, whether that's ML guided or otherwise. Python integration and exposure of active forces/parameters are planned, allowing this to be modified online so that a simulation may be tuned on-the-fly and in real time. This may include ML training or 'mimickry' of existing but not data-format compatible force fields (example: mimic an existing force field, then add more parameters).
A 'reasonable' baseline, with the caveat pre-1.0 versions of SInFF have the goal of "not blowing the simulation up". Note that this is a goal.

Future versions of SInFF are expected to be more reasonable starting points for simulation work, and possibly even accurate.

Legion is done in my spare time and as such is as profane and metaphor heavy as I wish (Legion is named for the planned scalable/modifiable functionality). "Demons" are the metaphor in use here.

First Class Functions / Functional Programming / Vectors

Where possible (such as in creating integrator functions and force field functions), "first class functions" are in use; these allow us to use generators/functions to specify large sets of functions. In addition, it enables the use of operator overloads; one such (planned) operator is the reverse operator/function, which will allow us to write code similar to

R*f(A,B) == f(B,A)

given that

f(A,B) != f(B,A)

It also allows us to do weird stuff like utilize multiple integrator functions, or add damping forces simply by creating a new function, etc.

Vectors are essentially wrappers around tuples; given Chapel's performance characteristics, this will allow us to easily distribute atoms/particles across nodes while allowing a lot of math-type operations (matrix application, dot products, etc).

System Builder

Currently hardcoded, but creates a system instance off of a pyridine.xyz file, then starts instantiating all of the necessary bits to create a system suitable for production. Can currently load XYZ files.

More to come on atoms (particles) and molecules (internally, a set of particles that may or may not have pairwise forces).

TRY ME!

The easiest way to try Legion is to use Docker and VS Code. Simply load the workspace, use the remote-containers functionality to open your workspace inside the container, then open the integrated terminal. From there, run:

make legionNoPy ./legion

It'll produce a silly little trajectory.xyz file, which in no way conforms to anything realistic, dynamics or otherwise. Please do not draw any conclusions about pyridine based on this.

legion's People

Contributors

Stargazers

Watchers

legion's Issues

I'm a test issue!

I want this issue to test the new FabricBot flighting feature, so I've selected the FLIGHTING tag in my FabricBot configuration to be what the flighting configuration applies to, and I've gone ahead and tagged this issue with it as well.

What is MD? How are atoms? How do they work? A Primer

What are atoms/electrons, and what is MD?

Introduction: on physical scale

Everything is quantum mechanical in nature; that is, the way everything moves and exists is subject to the laws of quantum physics (this statement is vague for a reason). We don't experience this in our day to day lives because the effects of quantum physics tend to be dominant only at very small length scales. What we experience as classical mechanics (Newton's laws; apples fall, if a bowling ball hits you, the bowling ball, too, is hit by you, etc) is just the average impact of quantum mechanics over all of our particles. tl/dr: we're too big, so we experience the oddness of quantum as the sort of physics we're used to.

Consider that our average length scale is the meter; we talk about rooms in square meters, height of people in meters, etc. Atoms exist at the nanometer scale, or 10^-9 meters; that is 10x10x10x10x10x10x10x10x10 times smaller than the normal length scale we think about. We know that at this scale, quantum effects are important, and that at our scale, we can just ignore them entirely. Where's the line, though? At what length scale can you say, "hey, it's good enough to ignore?"

The answer, it turns out, is "it depends". If we wish to simulate a game of drunken pool, we don't need to evaluate quantum mechanics to know whether a ball is going to drop into a pocket. What if we wish to play a very tiny game of pool, however?

On the physical nature of the molecule

One of the more common sayings about electrons is that it is both a particle and a wave, which is a totally useless explanation for anyone not doing the hard math. It's better to think of an electron as neither a particle nor a wave, but having properties of both. Think of it like the platypus: an egg laying venomous beast with a duck's bill. You would not describe a platypus as a "duck beaver snake". You would describe it as a platypus. It's just that a platypus has properties of ducks, beavers, and snakes.
It's a little easier, then, to parse an electron as an electron, and not as a "particle wave".

Molecules are made of atoms connected by bonds; what this really means is that you have jumble of atoms that are in just such a configuration; each atom has a certain number of electrons it can contribute to the total number of electrons. A molecule is like an atomic pie potluck where everybody has a certain place at the table. Everyone brings a certain number of pies, but then some people will take multiple pies, and others will only take little tiny bits of pies.

We can think of the fractional pie distribution as being like the electron distribution around the molecule; there's a bit of electron mist that must be distributed around the atoms, and how it's distributed depends on the nature of the atoms (opposites attract, like repels; how much of a positive charge does the atom have?), as well as their configuration relative to each other and the atoms around the molecule (if an oxygen, which is notoriously electron greedy, is surrounded by positively charge molecules, it'll try and nab some of the internal electron cloud to balance this out; likewise, hydrogen atoms, which like electrons but not aren't like, super into them, will be happy just being around an electron dense oxygen atom).

As the molecule moves in space, this electron distribution is bound to change; after all, the local electronic environment around the molecule is changing, and so the electron cloud will shift about to make up for it. This may result in the atoms inside the molecule moving, too. Or vice versa (cause and effect and all that). It's a balance of forces!

So, what about that tiny game of pool?

So, you wish to play the world's smallest game of pool; that's what atoms/molecules moving around is, after all. But instead, we generally have to take quantum effects into account if we want things to be totally accurate. Packages such as CP2K do this: they calculate what the electrostatic forces acting on a body must be, for each time step, using quantum mechanics.

Classical MD is a balancing act: how many simplified approximations can we make to the quantum mechanical nature of the atom without sacrificing accuracy? Generally, rather than treat each molecule as a set of atoms with fluffy electron clouds, we treat each atom as a pool ball connected by springs, representing bonds, to other pool balls; we then coat each pool ball in either a sticky substance, or a greasy, slippery substance, and then wreak havoc with a pool cue. "Force fields" for classical MD are simply the parameters/instructions for how strong those springs should be, which atoms should be coated in grease/sticky, and how strong said grease/sticky substance should be. The accuracy of the simulation then depends not just on how we're breaking down time (more on that later) but on those force field parameters. And the amount of grease/sticky and spring strength is determined once and set for the entire simulation. This is a remarkably predictive method, because it turns out if your molecules are big enough (like a protein), these approximations to quantum behavior are "okay" so long as your force field is optimized for the property you're trying to reproduce.

But, of course, if your force field is bad, you'll never get the right answer. And because the electrostatic terms are fixed, there are limitations to the amount of behavior you can reproduce.

Design Rationale, Goals, AI, and Roadmap

Goals/Architecture/Design

There are 3 major, orthogonal goals for this software:

Enable molecular simulations at scale (both in terms of very large systems, and running many small systems at once).
Advance molecular simulations by incorporating internally trainable AI into the design.
And, really, showcase how trivial Chapel makes all of this.

To that end, I propose the following architecture:

Master - the central bit of logic that decides how scaling should be done (how many tasks/executors to launch), as well as when AI should be hooked in. If it makes sense, it would be nice if this was implemented as a Chapel/Python hybrid such that it's straightforward to add AI hooks in (that is, when they should be launched, not that they will be launched within this process).

Executors - designed as pure chapel tasks/subprocesses, they'll take a set of atoms in a given system and perform the appropriate calculations. They are ultimately responsible for deciding X(t+1) given X(t) (where X is the current state of N atoms) and communicating that information back. The calculations are performed "in-house"; that is, whether they are launched as a task or subprocess, they perform the calculations here. These run calculations every time step, dt. They should be adaptable in the sense that they have logic to take commands both directly via hard-coded Chapel and through a set of pipes or ZMQ (if serialization isn't a problem).

AI Core - designed to be launched as a separate process, this is given a set of coordinates and other chemical parameters (charges, in particular) and is responsible for executing Python code on the given coordinates/charges. These can be launched at every n*dt, where n is any real, positive number.

An AI core must be able to do two things: train itself, and execute a model. In this way, we can make adjustments to our model within the program itself. Ultimately, during an actual simulation, we would not be training the model, but using one we have already trained.

Roadmap

Enable file I/O, and the building of an appropriate chemical system object that stores everything we need it do (such as the charge distribution, coordinates, and atom types). This includes creating an internal "atom type" set (e.g., H = 1, etc) that is modular; it should be straightforward to add new atom types.
Implement basic Executor, Master, and AI core. These should scale appropriately and have functional communication and hooks for launching different types of tasks.
Construct functional internal force field parameters. We'll start with a simple BD model using a Go-type potential.
Implement very basic force integration scheme. Verlet is rather popular for this reason.
Switch from Go to actual MD; this is simply changing the type of forces we calculate, and the amount of information we need to store in the topology (switching from 1 force type to about 5 or so).
For purposes of training, implement previously constructed force field (such as an AMBER force field). This will allow us to run dynamics to generate structures that we can then train the AI on.
Using the structures generated from the dynamics runs, train an AI model to predict how the charge distribution should change given the new set of coordinates.
Implement periodic boundary conditions (oh what fun).
???
Profit!

On Fixed Charge Force Fields

The location of an electron in a molecule (or, more appropriately, how a single electron has distributed its charge over a molecule) is highly variable, depending both on the external environment a molecule is placed in (a molecule in vacuum is going to have a different charge distribution than one in an electrical field), as well as the internal configuration of atoms (as the atoms move about, the energy function for the electrons change). Classical MD sticks with "fixed charge force fields", where the charge distribution is decided on the by selected force field, as changing the force field normally requires a substantial set of quantum calculations and is far too expensive to do on the fly (see scaling of QM/MM, or ab initio MD). As a fixed charge is an approximation, it brings all the problems approximations tend to bring (simulation errors, incorrect properties, and such).

A non fixed charge force field is one that is able to update itself on the fly; that is, it will perturb the charge distribution over a set of molecules given a new structure. These are less expensive than ab initio MD and may result in more accurate simulation results, but are still expensive: they take time to converge.

Glossary

Coordinates: An atomic chemical system consists of atom types and their coordinates in space. These may also include the charge parameters (as electrons are not evenly distributed within a molecule), although that is often stored in the topology.
Topology: Typically considered the parameters for how the specific system should evolve. If you treat the atomic system as being atoms connected by springs, how strong are the springs? This is the type of information that is stored in the topology (often referred to as the force field).
Force field: Includes information that is important for calculating forces, such as the strength of the "springs" (different bond types) and how the electron is distributed amongst a given set of atoms.

Potential name collisions

There are a couple of other software projects called Legion in related namespaces:

Legion, a competing programming model with Chapel
Legion simulation and modeling software, infrastructure simulation software

This probably won't ever matter, unless we were to ever care about SEO for this project.

Topology Development Superissue

The topology-related classes are currently very barebone implementations that do almost nothing; right now. Currently, they consist of a few functions.

We want to take a given system and create a set of topology parameters for it; namely, SInFF should provide a set of methods/functions to generate a topology given a System. The output of this should be functions paired to a particle or particle pair (atoms are currently the only particle available) which a dynamics integrator can act upon (Lamia).

Two ideas present themselves as to how to store the resulting information

Graph-based method: in this instance, an Atom (or the System) contains information about its bonded neighbors in space (more formally, pairwise interactions between particles); an integrator merely has to iterate through the atoms's neighbor list to figure out where to go next. This may introduce a lot of complexity that results in very little benefit for a dynamics integrator like Lamia, but could be very useful for splitting up space for Behemoth.
Store everything on the system, iterate through all functions in the topology space and apply to the relevant atoms.

Method 2 is the easiest and sufficient for something like Lamia; method 1 will certainly be more fun to code althouugh it will require more planning, which will come later in this thread.