Git Product home page Git Product logo

Comments (17)

dmzuckerman avatar dmzuckerman commented on August 9, 2024

@mangiapasta I have just pushed some tweaks based on a full read-through. A few parts still need adjustment.

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

After having read the full document, I wonder about our perspective on correlated data, especially for time-series that are often output as raw data. Overall I think the document does a good job of explaining that more "effectively uncorrelated data" implies better sampling and averages. On the other hand, I think that there is some misinformation / biased perspective about how problematic correlations actually are.

For example, in section 7.2, we state, "Both block averaging and autocorrelation analysis will produce different effective sample sizes...." I guess that technically this is true, but their respective arithmetic means are identical, so that the variances about the true mean are also identical. So the fact that we have different effective sample sizes is kind of a moot point. We also spend a lot of time talking about block-averaging. Again, not necessarily wrong (and many people use it), but the whole point of that analysis is basically to avoid time correlations. But we don't really say when / why one would want to do that, or if it is even necessary (which oftentimes it's not).

Overall, I think the community has fallen into the trap of believing that correlations are bad. To a certain extent, I agree that we should avoid them, e.g. when initializing multiple copies of a simulation so that we can get better sampling. But when we are dealing with raw data and time-series, correlations are a natural part of the dynamics and usually something that we actually want. Stat-mech basically tells us to expect stationary autocorrelations, so as a sanity check, it's nice to confirm that we got what we expected. Moreover, the autocorrelation analysis is going to give similar (if not identical) results as block-averaging done correctly, but the former has less choices to make.

This is all to say, I think we should make an effort to connect correlations to the underlying physics and dispel with what I see as a mythology about avoiding correlations. To that end, I think it makes sense to elevate the autocorrelation section relative to block-averaging. I also see latter as only being useful when data-storage is a severe problem, which (if others agree), we should perhaps say. I also think we should generally advocate for saving all raw data when possible and/or not preprocessing block-averages.

I realize others may feel differently, so I'm open to suggestions / discussion on this.

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

More comments on structure and overall document.

  1. I think the title should be changed to, "Best Practices for Sampling and Uncertainty Quantification in Molecular Simulations." I generally think of sampling as preceding UQ, and the paper is also structured that way.

  2. The introduction and scope needs a bit more discussion. For example, I think the first two paragraphs of section 3 can go in the intro. I think we should also give a high-level overview of what the document recommends. For example, I think we should highlight the global picture of what best practice looks like, i.e. starting with careful planning, testing of adequate sampling, and then UQ. (This shouldn't step on the toes of the checklist though). If folks also agree, we should discuss where our recommendations deviate from common practice, e.g. that we should not avoid correlations in time-series data.

from sampling-uncertainty.

dmzuckerman avatar dmzuckerman commented on August 9, 2024

@mangiapasta thanks for thoughtful comments. Here are my thoughts:

  1. I like the title the way it is because our main focus is indeed uncertainty quantification. Of course, we have to discuss sampling, but we certainly don't say too much about how to do it.
  2. I agree with you that the introduction can be fleshed out along the lines you suggest. However, I don't think we should remove any text from Sec. 3. It's ok to repeat key points and we want our sections to be semi-free-standing.
  3. Regarding correlations, I don't think we should change too much what we have. Of course, if there are genuinely misleading statements, we should correct them - e.g., if we suggest correlations can be avoided. But it IS true that correlations are the main reason why UQ is challenging for molecular simulation. I agree with you 100% that they are physical and in MD represent true dynamics - in fact, although it's not an accepted 'best practice', I am trying to advocate that people simply try to account for them in simulation analysis as I discuss a bit in a hot-off-the-press blog post ... which of course was motivated by our manuscript.

With all that being said, @mangiapasta why don't you make edits as you see fit to the intro ... and perhaps do anti-anti-correlation edits as a pull request?

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

In the section on Pre-simulation sanity checks, I don't know what the following paragraph is actually saying:

"If you read this guide through \emph{before} performing a simulation, you will have a much better sense of the criteria applicable to your data -- and which indeed \emph{should} be applied by knowledgeable reviewers of your work. Thus we strongly advise understanding the concepts presented here as well as in related reviews \cite{Grossfield2009,JCGM:GUM2008}."

Specifically, what does "criteria applicable to data" mean? Should this be something like, "expected properties of your data" or something similar? Also, how are criteria applied by reviewers to work? Is the idea of this paragraph to say something about what decisions can be made on the basis of a given dataset?

Also, it seems to me that the content of that paragraph really apply to the document as a whole, not just the Pre-simulation checks. Am I missing something here? This paragraph feels like it should go elsewhere.

from sampling-uncertainty.

dmzuckerman avatar dmzuckerman commented on August 9, 2024

@mangiapasta thanks for catching that flabby writing - I am the guilty author.

I think what I was trying to say belongs in the planning section. I intended to communicate that authors should be aware of the issues involved with doing good uncertainty quantification - and also be aware that the protocols we suggest may not cast the most favorable light on data obtained from a poorly planned or inadequately sampled study.

I think one of the ambitions of the Livecoms journal is that reviewers (from any/all journals) will use the Livecoms Best Practices articles as implicit or explicit criteria in evaluating manuscripts. Thus, authors have a selfish interest to follow our suggestions to the extent they're used by reviewers. And I just thought that awareness of all this at the planning stage would be most beneficial.

I guess there's no reason some of the same things couldn't also be mentioned in the general introduction, though I'm not sure how much we can presume that reviewers (for other journals) will take our recommendations seriously.

If you care to revise along these lines, please feel free. If not, let me know and I'll revisit.

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

Ahh, this makes much more sense now. Thanks for clarifying. I was slowly going through sections tagging things that didn't make sense to me. If you're okay with me editing, I'll try to rephrase a bit along the lines of your post just now. Alternatively if you want first crack at revising your own words, I'm fine with that too.

If you get a chance, let me know your thoughts on the intro. I added several paragraphs in an effort to foreshadow the document's overall structure and (hopefully) make the reader start thinking about issues that we raise.

from sampling-uncertainty.

dmzuckerman avatar dmzuckerman commented on August 9, 2024

@mangiapasta please edit, and I'll review. Thanks a lot. I should have some time tomorrow to go through whole doc.

from sampling-uncertainty.

dmzuckerman avatar dmzuckerman commented on August 9, 2024

@mangiapasta please let me know when you've had a chance to do this. I want to go over the doc after you're finished. Thanks!!

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

from sampling-uncertainty.

dmzuckerman avatar dmzuckerman commented on August 9, 2024

thanks!

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

In the quick-and-dirty section, I left a big red note near the end. I'm still trying to hash out in my mind the distinction between two related concepts. Input from folks here would be useful. In that section, it seems to me that we start by discussion convergence a la Law of Large Numbers / Central Limit Theorem. That is, more sampling makes an estimator converge to its true value. This seems like convergence in a true mathematical sense, insofar as I can state in what sense the convergence occurs and roughly how fast.

We then pivot to ``convergence'' as characterized by overlap of error bars. This seems to me more like a value-judgement proxy for the first type of convergence. That is, given a certain amount of overlap in error bars, do I feel comfortable concluding that my estimator has converged "sufficiently" in the mathematical sense. I'm not aware of any sense in which overlap of error bars provides rigorous assessments of convergence (although I could be wrong here).

Anyways, I want to avoid the illusion that statistical / probabilistic conclusions are the same as the value judgements and "policy" decisions we make on the basis of information gleaned from statistics. To say that we're comfortable with the level of convergence is not a mathematical statement. I feel like we're veering a little close to conflating these ideas, however.

from sampling-uncertainty.

dmzuckerman avatar dmzuckerman commented on August 9, 2024

@mangiapasta thanks for, as always, thorough and thoughtful comments. Let us know when you're done going through what you want to do and then others can revise accordingly. Leaving comments in the manuscript is the best way to ensure things get addressed.

Regarding 'overlap of error bars' as indicative of convergence, I think you're referring to the combined clustering subsection written by @drroe . He can give his thoughts on that issue, but bear in mind this is the 'quick and dirty' (now qualitative, semi-quant) section. So I think the whole section should be read as providing a necessary-but-not-sufficient test. I think elsewhere in the paper we say that there is no absolute test for convergence.

That being said, clarifying the language in places where you find it could be read the wrong way would absolutely be helpful.

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

Thanks for the heads up on authorship. I'm done editing that section now.

I agree that the section is really about necessary but not sufficient tests that should be easy to perform. My concern is more that I think there is some confusing (and possibly mathematically incorrect) language. More generally, I also think it's important to draw a clear distinction between mathematical statements (there is x-amount of overlap) and corresponding value judgements ("I'm okay with that amount of overlap," or "Uncertainties are too large; I need to revisit the simulations.").

At any rate, I'll wait back to hear from @drroe before changing anything else in that section.

from sampling-uncertainty.

mangiapasta avatar mangiapasta commented on August 9, 2024

I've got to take a break for a few days and do some other work. I put in a conclusion section and went through the whole manuscript. I didn't get a chance to do much with the bootstrap section, and in some places I left comments littered throughout instead of making edits.

I rewrote large parts of the linear propagation section (sorry original author). Happy to discuss / reinsert some of the original text, but I found some incorrect statements in there and wanted to give a more complete description of how the process works.

from sampling-uncertainty.

dmzuckerman avatar dmzuckerman commented on August 9, 2024

Thank you @mangiapasta !
@dwsideriusNIST if you can finish checking on notation etc, that would be great.

from sampling-uncertainty.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.