dmzuckerman / sampling-uncertainty Goto Github PK

Best Practices article intended for LiveCoMS

TeX 100.00%

sampling-uncertainty's Introduction

Sampling-Uncertainty

Best Practices article for LiveCoMS (Living Journal of Computational Molecular Science) [http://www.livecomsjournal.org/] ACCEPTED ON 17 OCT 2018

Link to published versions: https://doi.org/10.33011/livecoms.1.1.5067

This is an article motivated by the Best Practices workshop organized by Michael Shirts on behalf of MolSSI, held at NIST on Aug 24-25, 2017.

The article is intended as a guide for beginning molecular simulators (and beyond) to understand and assess sampling quality, 'convergence', and statistical uncertainty.

Comments and suggestions from the community are welcomed. It is hoped that this article will be regularly updated to maximize its utility for the community. Significant contributions will be considered as a potential basis for authorhsip.

The initial draft scope of the article: This article is geared toward studies attempting to quantify observables and derive reliable error bars based on “standard” canonical sampling methods (e.g., molecular dynamics and Monte Carlo) and associated “enhanced” sampling methods. Some problems and systems may be better studied with cruder techniques and analyses, which will not be covered here.

sampling-uncertainty's People

Contributors

Stargazers

Watchers

Forkers

livecomsjournal dwsideriusnist sukritsingh minghao2016 chenbq18

sampling-uncertainty's Issues

Checklist

Per reviewers

In the checklist on page 6 the authors say "Consider publishing unprocessed simulation data…" Perhaps the authors would like to be a little stronger here? Strongly urge? I think this is important and perhaps they do too.

ACF / Block Analysis

Per Reviewers:

"p4: At "correlation time", maybe add a footnote for the MC case (as there is no "time" there); later, a footnote 11 says this, but it is a bit late"

"p11/footnotes 14,15: Very often, one encounters TCFs that have a damped-oscillatory form. It is not clear at all to me how to extract a "tau" from these. And the time integral of C(tau) is only expected to deliver tau if C(tau) is exponential. What if it is not? Finally, if I agree that the effective number of sample is proportional to N_config/tau, is the appropriate prefactor really one? (I have never worked out the math, but I am sure this has been done)"

"p16/footnote 16: This footnote is unclear to me. Seems to me the blocks are non-overlapping and tiling segments of the entire time series (the possibility that it may be otherwise was never mentioned in the text; and I don't think anyone would do this in practice"

Block averaging citation

I was trying to write up a bit about block averaging and cite Flyvbjerg and Petersen. I was going to mention Kolafa's approach to correcting the uncertainty for correlation[1], but noticed that Kolafa's paper from 1986 (which takes block averaging as a given) predates Flyvbjerg and Petersen's paper[2] (from 1989) by 3 years. Frenkel & Smit cite Flyvbjerg and Petersen while Allen & Tildesley (published in 1987) cite a bunch articles, the earliest of which is Friedberg and Cameron[3] from 1970, which sticks to squared errors, but otherwise describes block averaging and its utility.

Is there some reason to cite Flyvbjerg and Petersen as the originator of block averaging? Did they add something I'm not noticing? I'm tempted to just cite Frenkel & Smit or Allen & Tildesley.

Links to analysis software

Ideally, each section should link to working tools you trust.

Definition of correlation time is uncommon

"the correlation time (denoted here as τ ) is the longest separation time ∆t over which x(t)
and x(t+∆t) remain (linearly) correlated."
This definition is very intuitive and easy to calculate, but in practice, it is rarely used and inconsistent with popular definitions.

For example, the rotational correlation time τ is defined such that R(t) = R(0) exp(-t/τ), where R is analagous to the correlation coefficient. It can be seen that R(τ) > 0. In fact, R(t) > 0 for any t, although τ is finite.
The above definition is the exponential correlation time. Another definition is the integrated correlation time. In many trivial cases, we also have R(τ) != 0. In the case of exponential decay, the two definitions are equivalent.
More proper definitions are: (1) the time scale at which the correlation is, on average, reduced by some factor (e.g. e). (2) When the statistical error calculated with all samples over a time period of T is equal to the statistical error of N independent samples, the (integrated) correlation time τ is T/N.
There are certainly other definitions, but in the context of this review, I think the integrated correlation time is the most relevant.

Global sampling section

I figured it makes sense to create issue threads for each section. Hopefully this will keep some of the discussion a bit more self contained. A few opening thoughts on this section.

First paragraph, the text, "sampling quality, which can be framed most simply in the context of single-trajectory data:``Among the very large number of simulation frames (snapshots), how many are statistically independent?'' From a dynamical perspective, which also applies to Monte Carlo data, how long must one wait before the system completely loses memory of its prior configuration?"

This just sounds like an autocorrelation analysis to me, but reading more of the text, it seems that the method is subtly distinct. Can the original authors say something that would better distinguish this from autocorrelation analysis, or is it just the latter in disguise? I guess I'm asking for a definition of what "quality" means.

Key caveat subsection, the sentence, "The discussion here will focus largely on biomolecular systems, or more precisely, on systems for which it is straightforward to define a meaningful scalar distance between configurations." I guess I'm lost as to what these configurations actually are. As someone who operates more in the materials science, it would be helpful to have some idea of how these configurations are defined. Are they just phase-space distances? Those are easy to define even for material systems.
The decorrelation analysis in Section 6.2 seems like it's more about assessing that the phase-space has been well sampled. I left a few related comments in the text. Am I misunderstanding something here?

I left other questions scattered throughout the section. Partly this may just be my lack of knowledge about biomolecular sims.

PDF Rendering Issues

From a reviewer: "The pdf i received had substantial formatting issues. It appears that the letters "fi" often was reproduced as "!" and many other such curious formatting issues appeared. For example, there where "squares" where letters should be and "Definitions" has an á with no fi. I encourage the editors / authors make sure these formatting issues do not appear in the final version."

Can anyone confirm this issue? The PDF opens fine in three viewers on my Linux PC and in two viewers on my Mac.

Catch all / Remaining Reviewer Comments

Reviewer comments:

"Some pitfalls could be mentioned. Students often take the average from a trajectory and then the standard deviation and assume this is the "error bar"! They don't realize that the standard deviation here is just a measure of the fluctuations, which is just related to system size. It might be worth mentioning somewhere."

"The density of information contained in different sections can vary strongly. That makes the manuscript partially harder to read. While sections 1-5 are rather stretched out and new information is presented slowly, section 6 contains much more information in much less text. Moreover, while in sections 1-5 also rather basic concepts are introduced with care, in section 6 more advanced concepts are just introduced very quickly or not even explained at all. Examples for this are Voroni cells mentioned at the beginning of section 6.1 on page 12 or PCA mentioned on page 13. This makes section 6 partially hard to understand. Figure 6 helps to understand this section, but is not referenced in the text."

"p2: Maybe one should also include software/programming errors (see e.g. Ref 1, and a number of other recent references reporting "bugs" in heavily-used MD codes like GROMACS)"

Equilibration section

Did anyone read this over besides me? I don't too much straight MD. @agrossfield and @drroe would you please take a look and edit as needed? @mangiapasta do you have additional perspective? Thank you.

Pre-simulation sanity checks section

@agrossfield @ajschult @drroe I would appreciate your edits and comments on this brief section. Thank you.

Reference that might (or might not) be useful

This paper looks at reproducibility of simulations evaluated based on large numbers of simulations of the same thing. Seemed like it MIGHT be relevant to your paper though I haven't read it yet. So take it or leave it. :)

Terminology / NIST internal review

When it comes to language about uncertainties, I think it would be helpful if we can settle on a common terminology and definitions. From NIST perspective, this may a necessity. The document will have to go through an internal review that takes ~ 1 month before it can be published with NIST authors. Our reviewers will check language, units, etc. to make sure that they comply with our standards (we are, after all, National Institute of Standards and Technology....). I will look into what the requirements are for a document of this type.

As an example of something I'd like to flag, near the beginning of the document on specific observables uncertainty, I found the following line:

"The standard error is the standard deviation of the distribution of the results that would be obtained by repeating the simulation."

I've seen standard error used a few different ways in various communities. I've always taken it to be the sample variance divided by the number of samples, which characterizes the variance in the sample mean relative to the true mean. (I'm pretty sure this is consistent with NIST definition, but I'll double check).

I'm not sure the best way to handle this, but perhaps as we go along, folks can flag various terms in this Issue thread and we can settle on common definitions...?

Fig 2 explanation

Per reviewer:
The lower panel of Fig 2 could use a little more explanation. The caption should define the colors which I think are RMSD angstroms.

Specific observables section

@mangiapasta and @ajschult - you guys got a great draft going for the specific observables section. I have some questions/comments that hopefully you could address in the next few days:

I think this is for Paul: The first subsection addresses 'quality of data', which really means 'overall qualitative behavior of data', I think. You give an example of material failure with increasing strain. I think you should point out that this is a single observable measured (I think) over many different simulations. So we will want a separate example for a single observable measured from a single simulation ... if you put a stub paragraph in for that, I can fill it in.
Re propagation of uncertainty: I was wondering about the example of block averages for free energy estimation. If the individual blocks are pathological, isn't that itself an indicator of questionable sampling? Also, it seems that the later aveat, as well as the subsection on 'synthetic data', calls into question using a Taylor expansion for a nonlinear derived quantity. So is free energy a good example here?
Re bootstrapping: Is there an implicit assumption that the original data samples are statistically independent? Please clarify
Synthetic data: looks like this should go right after propagation of error since they discuss the same issue if I understand correctly
Dark uncertainty: Wish I had thought of that name!!

As I said, this is off to a great start. I'm nitpicking so it can be even better.

Enhanced sampling: replica exchange section and WE

@drroe thanks for all the material here. I think we will need to re-categorize and re-arrange some of it

'before running' - these great items should probably go in our pre-simulation sanity checks section
'when running' - leave this for now but it probably should go in a separate article on best practices for enhanced sampling REMD. (You may want to collaborate with Yuji Sugita and/or Adrian Roitberg to write.) These points don't address data analysis per se.
'assess the results' - here's what we want! Could you just expand this a bit to be clearer - perhaps give a concrete example of an observable and be clear on whether a trajectory refers to a fixed parameter set or a configurationally continuous paramter-wandering traj.

Repeated simulations

@ajschult @agrossfield thanks for raising this point. I started a new issue on this because I think it's important enough that I'd like to be sure that we mention it in multiple places. I already put it into my discussion of asssessing global sampling using enhanced-sampling methods.

Another point I'd like to make somewhere is that repeated simulations can provide statistically meaningful observables even when one is far from equilibrium. For example, one can measure the distribution of RMSD values between 8 and 10 nsec after starting from structure X by running, say, 20 regular MD sims starting from X. That distribution is well-defined in a statistical sense and might be a way to contrast say apo vs holo or WT vs mutant conditions. This is something we're trying to exploit in my group ... since I don't believe there's any chance to reach equlibrium in any system of significant size and interest. This study-the-transient approach finesses the unsolvable problem of finding 'different enough' starting structures.

Use of "random" in the glossary

The word "random" is used a bunch of times in the definitions of key terms in a way that makes no sense to me. For example, "independent observables" are defined as "Random quantities that
are uncorrelated". These quantities are presumably calculated from coordinates in a deterministic way, so referring to them as random is confusing.

I understand that when we do statistics on stuff we're treating the snapshots as random samples, but this nomenclature is not going to be helpful when dealing with new scientists eager to learn.

Qualitative / semiquantitative Checks

Per Reviewer:

"p10: I think that it is dangerous to oppose a "strategy 1" where you do N "independent" repeats and use stddev/sqrt(N) to a "strategy 2" where you do a single simulation and need to worry about correlation times. As a beginner, I would go for strategy 1 because it seems you have much fewer things to worry about. And if I am a "greedy" beginner and I can do 1000 potential evaluations, I would go for 1000x1 timestep, because then my error is reduced. The only problem is that my probabily distribution is entirely determined by the arbitrary way I select my "independent" samples. In other words: you also have very much to worry about correlation times in strategy 1 !!! And it may turn out that if you add sufficient equilibration time for all your replicas, the multi-repeat approach ends up as complicated and more expensive than the single-simulation way."

The checklist itself

Have people reviewed this recently?

One question I have is whether we need to advocate that starting structures be 'representative of the true ensemble' as this suggests we need to know the answer in advance. Is it better to say 'as diverse as possible'? What do you think @agrossfield and @drroe ?

Equillibration and Quantification of Global Sampling

Per reviewers:

"In section 5, the authors suggest to remove non-equilibrated trajectory regions from the beginning of simulations to improve simulation results. In my experience, this can indeed be very important and improve convergence of results a lot. Two references (27,28) for the determination of the time needed for equilibration are given, which both discuss algorithms for automated determination. I personally also find another, probably simpler approach proposed by Klimovich , Shirts & Mobley (Guidelines for the analysis of free energy calculations. J Comput Aided Mol Des. 2015) very useful. Simply speaking, in this approach forward and reverse cumulative averages are compared and data is discarded from the beginning of the trajectory until both match. With this approach, also visualization of convergence is nicely possible."

"Section 6.1 contains great information but I was left wanting more information and maybe a practical example to help solidify concepts."

"p11: "Relaxation should therefore…" -> I don't understand the sentence."

First author of ref 50 is not boldface

Eq. 32: Define \Tau

Just define \tau in eq. 23

Vocabulary: 'Confidence Interval' should be changed.

Learn something every day...

Per the VIM and GUM, the term 'confidence interval' (See VIM, definition 2.36 [coverage interval], Note 2; or GUM 6.2.2) should not be used. This would involve a lot of revisions to the paper.

Substitutions: 'coverage interval', 'expanded measurement uncertainty', or (not in GUM, but suggested by a NIST statistician) 'uncertainty interval'.

Stuff for Alan to check/expand in global section

@agrossfield at some point you should look at what I've said about your block covariance method in the global sampling assessment section.

Definition of coverage factor k

Unless I'm missing it, we don't really define what a coverage factor is, and since it's not a common term (I'd never heard of it), we need to be very clear. I think the best solution is an entry in the glossary, plus a line or two in the text where it's first used.

Propagation of Uncertainty

Per reviewers:

"In section 7.4 (propagation of uncertainty) expressions for the uncertainty of derived quantities are derived. While it is nice to see the derivation, I personally don’t think it is very necessary as the aim of the manuscript is to give basic orientation in the field of uncertainty analysis. It is the only derivation in the whole manuscript."

"However, section 7.4 is completely omitting the important topic of error propagation through multiple steps of simulations/experiments. I think it is very important to properly quantify errors if multiple simulation steps are needed calculate a certain quantity (this happens often in the field of free energy calculations). While the propagation of the experimental standard deviation of the mean might be straightforward, the authors should explain how such error propagation works if the uncertainty is quantified with confidence intervals, as the authors suggest."

Question about uncertainty propagation section

I understand making it clear that there are limitations to the use of the Taylor expansion approach (this is important and NOT obvious to many people!).

However, I don't see any reason why you couldn't directly estimate the uncertainty in dA via block averaging. Why would averaging exp(-beta dU) in block be ok, but not taking the log of it? I know it's not exactly the usual block averaging formula (which has the average as the last operation), but I can't think of a reason why it wouldn't work. You compute dA using blocks of the data, and watch the variance change as the block change size. If I'm missing something (which wouldn't surprise me), we need to explain it, because it's not obvious. If by rare chance I'm right, we should change the section.

Bootstrap Section

Per reviewer

"Section 7.6 is key, and again I was wanting to see more details - this topic is confusing to students and if the authors could provide more details with an example, it would be great."

NOTE: In email discussion with @mangiapasta, we agreed that this section has to rely on references; to do a full work up of bootstrap/resampling would be a paper by itself.

Definitions section

I think we need to define "bias" in the glossary. This term appears multiple times in the equilibration section.

Multiple hypothesis testing and p-values

I had an interesting discussion today at lunch with Dave Mathews that I thought might deserve discussion here. We were talking about the way that analysis of MD simulations is done iteratively, and he pointed out that this is the way p-hacking occurs.

Imagine we have two related simulations (or sets of simulations) we're comparing: we try one thing, and the p-value says they're not significantly different, so we try another, and another, and we focus on the quantities that say the two simulations are different. How is that not p-hacking? I know there are ways to handle this correctly (I think you have to essentially lower the p-value for significance with each successive test, though I don't know the theory), but I've never heard of anyone actually doing this in the context of MD.

Does someone with a stronger stats background than mine have a suggestion for what the best thing to do is?

Deposit PDF in repo?

Any chance you guys would want to deposit the latest version of the PDF in the repo with commits? I just stumbled across this and would love to be able to see the PDF without having to compile. Typically what we do is just update the PDF with any commit which touches any of the content.

Title of Section 7

I propose that the title of section 7 be changed to "Computing error bars in specific observables", or, maybe better, "Computing uncertainty in specific observables".
Computing the error (in the VIM sense) would require to know the true value...

Figures - need reference/permission or credit

Any figure/data shown should have a reference if it's published (in which case we need to get permission for it - I will find out more about this). If it's not published just put in a ref to yourself - e.g., [P Patrone, unpublished].

Notation / Definitions

Per reviewers:

"p3: In a didactical sense, maybe it would be worth stating (footnote?) that a probability distribution P(x) has units which are inverse of those of x, and that it should be normalized (i.e. integrate to one); "probability" graphs which are not normalized or/and reported in "arbitrary units" or in "percent" are a very common "bad practice", I think"

"p3/footnote 3: "As such, we can only…" I agree that it is not possible to fully characterize a P(x) based on finite sampling. But if you say "estimate", I don't know why you should limit the statement to expectation value and variance (one can "estimate" higher moments as well, and P(x) itself as a histogram approximation)"

"p3: The notation s(x_j) is mathematically unclean (especially for a "best practice" paper), as it suggests that "s" is a function of "x_j". What is meant is more like s({x_j}) or s(xv) where xv is the x_j-vector"

"p3: "which is a stronger condition…" - there is a beautiful counter example (maybe for a footnote?): data that is perfectly aligned on a circle! (perfect correlation, but zero linear correlation)"

Propagation of Uncertainty - Degrees of freedom for pooled uncertainty

A small revision to the paper could be add a paragraph that describes how to estimate the degrees of freedom for the pooled uncertainty from the source measurements (Welch–Satterthwaite equation), which facilitates computation of confidence intervals for derived quantities.

Uncertainty Propagation Addendum

I'm late to this discussion, but I'm a strong supporter of the topic and the project, and I have a couple of comments on Section 7.4.

I think that the section is a little "thin". Although the density example of Eqs. (20) and (21) is fine, it's a bit trivial, and more examples could be given. Here are 2.

The example of the propagation of the uncertainty in enthalpy to that in Cp is given in our paper in JCP, 147, 034508(2017).

Over a given T range at fixed P, one can fit h(T) = a+bT+cT^2 and then differentiate to get Cp. The uncertainty in Cp arises from the uncertainties in the regression coefficients b and c. The linear approximation gives
s^2(Cp) = s^2(b) + 4T^2s^2(c) +4Tcov(b,c)
The covariance matrix is available from the output of a typical linear regression algorithm (e.g., in Excel).

The example of the uncertainty in the solubility in an aqueous solution at a given (T,P), which is calculated by equating the solute solid chemical potential mu_s to its chemical potential in solution, mu(m), where m is the molality. This is described in Eqs,. (21)-(24) in our review paper in Molec. Phys., 114, 1665-1690 (2016).

This is an example of uncertainty propagation via the solution of a nonlinear equation. In this particular case, the goal is to estimate the uncertainty in the solution m_s of the nonlinear equation

mu(m) = mu_s

Both mu_s and mu(m) have simulation uncertainties in this case.

Update with DOI of published article?

It'd be nice to make it slightly easier to navigate to the published version of the article from the repo. Perhaps update the main README.md to link to it by DOI?

Conclusion section

@dwsideriusNIST and @mangiapasta I made some edits here. Be sure you agree ... please re-edit as needed.

Qualitative section - formerly 'quick and dirty'

Moreover, if I understand correctly, it's literally the same thing as Lyman and Zuckerman cluster population analysis, with the exception that there they do it as a function of block size instead of just breaking it into 2 pieces.

Couple quick comments

I'm scanning through this as I prepare a lecture, and happened to notice a couple things:

one occurrence of "uncertainy" rather than "uncertainty"
The checklist has this point, which is quite helpful: "This practice ethically questionable and, at a minimum, can significantly bias your conclusions. Use all of the available data unless there is an objective and compelling reason not to, e.g. the simulation setup was incorrect or a sampling metric indicated that the simulation was not equilibrated." However, I wonder if this should be slightly expanded to add the idea that, if you're discarding something due to bad sampling or poor equilibration, you should be applying the same analysis to ALL of your simulations. In my experience it's easy to closely scrutinize any simulations which give bad results, while ignoring simulations which give good results. If you, say, carefully check your "bad" simulations for equilibration/poor sampling but don't do the same for your "good" simulations you're also cherry-picking, and this could perhaps be made clearer.

Quantitative comparison between autocorrelation function and block averaging

In section 7.3, there is a discussion of autocorrelation analysis and block averaging as methods for estimating the number of independent samples, but the discussion does not make any recommendations on whether it is better to use either method in specific cases.

Have there been any studies to quantitatively compare these two measures? For example, testing the minimum number of samples before each method becomes unreliable, or whether the extra information in a block averaging scheme makes a difference in uncertainty. We are working on a best practices document for property calculation from MD and are interested in the effect of choice of method.

N vs N-1

From a reviewer: For eqn 8, the authors may wish to discuss when n and n-1 is needed. I find this is a source of confusion with students.

REMD

Per Reviewer:

"p19: The authors might want to add local-elevation, which predates the two cited methods."

Global (but not large) changes

Are we at the point in this document where we can do full read-throughs and suggest smaller changes to the content and structure of the document? I'm reading through the whole thing today and have suggestions for revisions to sections I've not yet worked on.

dmzuckerman / sampling-uncertainty Goto Github PK

sampling-uncertainty's Introduction

Sampling-Uncertainty

sampling-uncertainty's People

Contributors

Stargazers

Watchers

Forkers

sampling-uncertainty's Issues

Recommend Projects

Recommend Topics

Recommend Org