Git Product home page Git Product logo

phylogenetic_biology's People

Contributors

brevans avatar caseywdunn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

phylogenetic_biology's Issues

small graph caption error

For figure 2.10, I suppose b) should be "soft polytomy" and c) should be "hard polytomy" instead of "with polytomy" --- probably just a typo.

Bookdown options?

Dear Prof. Dunn,

These are such minor questions, but are there options to highlight bookdown text and read bookdown as two pager (viewing "two pages" at the same time)? That is, more like reading a book...

Best,

maybe suggestion?

To decide whether to use LRT, AIC, or BIC model selection criteria, you need to run a model selection criterion selection analysis. Just kidding. To decide which to use, you should apply your knowledge of phylogenetic methods critically. In many cases, these different approaches will lead to very similar decisions about model selection. As an example of a fairly typical approach, the program iqtree runs AIC and BIC analyses, but proceeds under the model selected by BIC unless you explicitly step in to apply a different model. If, on inspection of the analysis results, you find that AIC selects a very different model, it would be prudent to run your analyses under that model as well to see if it leads to differences that are relevant to the questions that motivate your project.

Not sure if it is too much that AIB and BIC can deal with non-nested models (as it's not about a test-statistic). Maybe it's too complicated but for students who are interested, Burnham and Aderson is a great resource

Another comment more than an issue

Because of the linear relationship between the number of replacements and the product $\mu t$, rate ($\mu$) and time ($t$) are conflated. In many scenarios you can't estimate them independently. If there are a small number of replacements, for example, you can't tell if there is a low rate over a long time interval, or a high rate over a short interval. Both would give the same resulting number of changes $n$. Because rate ($\mu$) and time ($t$) are so often confounded in phylogenetic questions, often the rate is essentially fixed at one and the unit of time for edge lengths is given as the number of expected evolutionary change rather than absolute time (years, months, etc). You will often see this length as the scale bar of published phylogenies (Figure \@ref(fig:sim-tree-cnid)). The exception is when you have external information, such as dated fossils, that allow you to independently estimate rates and edge lengths in terms of actual time.

I love the way you explain this fundamental concept of phylogenetics, but this is precisely one concept that I often see as being completely misunderstood by students and even by phylogeneticists. Not sure if adding some adjectives or something to highlight the critical importance of this central "problem" of phylogenetics (and how, as you mention at the end, dating with fossils or tip dating with viruses help disentangle time-rate)

Missing word?

The most widely used DNA sequence evolution models include the General Time Reversible model and its derivatives (Section \@ref(expanding-the-models)). The GTR model 11 parameters (Figure \@ref(fig:evaluation-models-nested)). These include the global rate $\mu$ used to tune the overall rate of evolution. Next come the relative rate parameters $a,b,c,d,e,f$ that modify the rates of change between particular nucleotide states, so that they can differ from each other. Finally we have the equilibrium frequencies $\pi_A,\pi_C,\pi_G,\pi_T$.

Here: The GTR model 11 parameters maybe The GTR model has 11 parameters?

Word missing

How do we select a starting state at random? We could draw the starting state from a big with equal frequencies of each nucleotide, but our model allows us to make a more informed selection than that. We implemented $\mathbf{\Pi}$ because we wanted to describe cases where the nucleotides do not occur at uniform frequencies, so let's draw from that distribution instead. For the toy mammal model we made above, that is 0.205 G, 0.205 C, 0.295 T and 0.295 A. We just sample a single nucleotide from this probability distribution.

a word missing here, big ?? We could draw the starting state from a big

maybe a typo?

Second, let's consider the case where the data are just as likely under the hypothesis is they are in general. This would be like having a medical test for a specific condition where the test was so bad that the data (test result) didn't depend in any way on the hypothesis (the presence of the condition). For example, the test returns a positive 10% of the time regardless of whether you have the condition or not. In that situation, $P(D|H)=P(D)$. Since $P(D|H)$ is in the numerator and $P(D)$ is in the denominator, they cancel out. That leaves $P(H|D)=P(H)$ -- the posterior probability of the hypothesis is the same as the prior on the hypothesis. The data didn't change our understanding of the hypothesis at all. This is the behavior we want when the data have no information relevant to the hypothesis.

This sentence does not read well. Maybe a typo or missing words?

Typo

Here we have built up the conceptual, mathematical, statistical, and computational machinery to simulating DNA evolution on a tree. Sequence simulation is useful for a variety of things, including generating datasets under known conditions to test tools. A major value, though, is to to think in a full explicit way about how you are modeling evolution. This probabilistic model framework is the exact same one we will use as we move to our text task, inferring phylogenies from sequence data.

Change text for next

Typo

The GTR model, and its nested derivatives, capture only a small subset of evolutionary processes and therefore can't describe many of patterns in observed data. One of the most obvious patterns you will notice when inspecting multiple sequence alignments is that the amount of variation is very different across sites. Home columns will be highly variable, while others are nearly constant. This pattern is due to extensive heterogeneity across sites in rate of evolution.

Home columns will be change to Some

Meaning of parameters

you will get a line. $m$ and $b$ are model parameters. $m$ is the slope of the line, and $b$ is the intercept.

I think that adding one sentence about what parameters "mean" could help to create an even better intuition for what parameters do in models. In this specific example, you could add a sentence explaining the m is the value that captures the relative change in the variables or the value you would need to multiply x to get the value of y (this is evident from the equation, but it might help); and b is the value of y when x is 0. Not sure if you want to even add that to the plot.

typo and maybe rewording?

- The six rate parameters are constrained such that $a+b+c+d+e+f=6$. If they were all free to vary, than values other than 6 would lead to changes in the global rate rather than the relative rates. Imagine setting them all to $10$, for example. This would be equivalent to setting them all to 1 and setting $\mu=10$. This mathematical relationship between the relative rate parameters means that only five of them can vary independently, because the sixth will depend on the other five and the fact that they all sum to 6. This means that five of the rate parameters are stochastic, and one is deterministic. For our purposes it doesn't matter which one is deterministic, only that one of them is, so I'll treat $f$ as the deterministic rate parameter.

to vary, than values other than 6 change to then

Also, I wonder if, in the following sentences, the explanation for the relationship between the relative rate and the related rate is super clear. I mean, the equivalencies 1 to 10 to 6. Maybe is just the grammatical structure of the sentences?

rules of probability?

If we aren't clamping the internal node states as well, how can we calculate the probability of just the tip node states? The key is to consider all possible internal states. Each configuration of internal node states represents one possible history that gave rise to the observed tip states. We can sum the probabilities of each of these different ways to get the tip states to find the probability of the tip states over all possible histories. We are summing the probabilities because these are mutually exclusive histories that could give rise to the observed data. For example, if we want to find the probability of getting a total of seven when rolling two dice, we need to add up the probability of each way to get seven (1+6 *or* 2+5 *or* 3+4 *or* ... 6+1). This is different from when we multiplied probabilities to find the joint probabilities of multiple events occurring together (*e.g.*, the probability of rolling a 4 *and* another 4).

Not sure if you want to add here (a few line above) the "rules of probability": when events are and == multiplication, when events are or == summation. This mnemonic works for me.

Misc comments

A phylogenetic graph is an abstraction, and for it to be useful it is important to keep in mind what features of biology we are attempting to represent. The nodes are entities that can evolve, like organisms or genes. The edges indicate evolutionary relationships between those entities. You could imagine as, an extreme case, a graph that showed every single individual that ever existed in your group of interest, say mammals. Each edge would connect literal parents and offspring. That would be a big phylogeny, and you would never have enough information to know it all, but it does exist even if unknowable and unwieldy given our current tools. A phylogeny is a subset of that graph, where we often retain a single individual per species as the tip nodes, and retain nodes immediately preceding divergence events as the internal nodes. In this respect, a phylogeny is a subgraph of the entire history of life on Earth.

As part of the section on abstraction, it might be worth mentioning that parent nodes are rarely the immediate parents like they are in a genealogy. They are some ancestral lineage in common to the two child nodes, and could be thousands or millions of years removed.

A few minor things in chapter 2

Hi, I just want to raise a few minor suggestions:

  1. In section 2.1, under figure 2.3, this sentence has two "it" that are a bit confusing
    " Because it makes it easier to learn from adjacent fields when using mathematical conventions that are shared across fields, I will tend to use mathematical notation for phylogenies rather than the classical botanical nomenclatures."

  2. Right after figure 2.5
    "Rectangular layouts are the most common, because the entire edge length is along one axis of the plot. In a rectangular tree, each node is depicted as a line that is orthogonal to the edges. The confusing thing is that, because this line has the same width and color as the edges, it looks as if it is part of the edge. It isn’t though– its length is arbitrary, and it just shows which edges attach to that node. It also adds right-degree elbows where the ends of the node lines connect to the edges, forming a corner. "

I am an undergrad who has never taken a phylogenetic class before. I found the paragraph rather confusing. What is "this line" referring to? I understand the node and the edge are perpendicular to each other, but which one is which? Perhaps it can help if we have arrows on the figure 2.6!

Gladys Fang

Typo

Note that I am not using the term "significance" when referring to bootstrap support. This is because bootstraps don't have a clear statistical interpretation. It is a scale that varies from $0$ to $1$, but is not itself a significance. It jsut indicates how frequently an edge is recovered when the data columns are resampled. This gives a sense of how broad support is for the edge across characters, but is quite complicated in reality. For example, some variation across bootstrap replicates is due to resampling, but sometimes it is just due to the stochastic nature of heuristic maximum likelihood searches.

It jsut indicates --> just

Minor typo

This relationship between simulation and inference is widely used in a variety of fields. The probability of the observed data given a hypothesis is referred to as the Likelihood. Searching for the most likely hypothesis as referred to as Maximum Likelihood (ML).

change as to is in likely hypothesis as referred to

Typo in Chapter 4.3

We can calculate joint log probabilities as sums of log probabilities for each event, rather than as the log of products of the probabilities. Addition is much faster than multiplication for computers (since multiplication is a series of addition operations), so this speeds things up. For these reasons you will almost always see log likelihoods, rather than just likelihoods, published in the literature. Note that becuase likelihoods are probabilities and therefore range from 0--1, the log likelihoods will range from $-\infty$ (for probabilities very close to 0) to $0$ (for probabilities close to 0). Since likelihoods tend to be small, they end up as log likelihoods that are negative numbers with large absolute values.

"the log likelihoods will range from $-\infty$ (for probabilities very close to 0) to $0$ (for probabilities close to 0)" should be

"the log likelihoods will range from $-\infty$ (for probabilities very close to 0) to $0$ (for probabilities close to 1)'

Chapter 4 - add worked ML

Optimize edge lengths on each topology, then show optimized likelihood of each topology and indicate ML tree. May need to add more than one site for this to be interesting.

Chapter 6 miscellany

Comments on chapter 6:

  • Haven't read ahead yet, but will you talk about amino acid models? (Maybe just mention there are comparable sets of models for AAs?)
  • The GTR model 11 parameters. Missing the verb.
  • so that they can differ from each other. You go straight into how a-f have to add to 6, but a simple example like you do for π (Just something simple like 0.5 means equal probability of transition from X to Y while 0.2 means ...) would really help explain
  • In typical use, mu=1 Should that be µ?
  • That means that the best possible likelihood under HKY85 is also available under GTR -- good message.
  • there are challenges far short of this extreme example - little bit awkward?
  • When talking about I and G maybe use the Γ symbol and put (gamma) in parentheses? G is usually a lazy shortcut for finding the gamma key, right?
  • the topology we are evaluating is the focal topology -- should be as the -- and maybe bold focal topology.
  • focal tree as a whole be asking how frequent -- by asking
  • It jsut indicates how frequently an edge is recovered --> just

Pi

\mathbf{\Pi} not rendering in chapter 3.

Typo and suggestions

```{r evaluation-models-nested, echo=FALSE, fig.cap="A hierarchical view of DNA substitution models. The number of degrees of freedom is determined by the number of independent stochastic parameters (boxes with rounded corners). All other parameters are either constant (set to a specific value ahead of the analysis) ir deterministic (their value depends on the value of other parameters according to specified relationships). Here $\\mu=1$, such that the edge lengths in the phylogeny are the expected amount of evolutionary change. The models are listed from top to bottom by increasing nestedness. Rates are ordered so that transitions and transversions are adjacent. Any model could be realized as a subset of the possible parameter space of the models above it. The visual nomenclature is inspired by Hohna et al. (2014). See the [iqtree DNA model documentation](http://www.iqtree.org/doc/Substitution-Models#dna-models) for a longer list of models."}

All other parameters are either constant (set to a specific value ahead of the analysis; boxes with straight corners) or deterministic (their value depends on the value of other parameters according to specified relationships; dashed boxes)

Comments more than issues

Imagine that when the DNA is being replicated, most of the time the appropriate nucleotide is incorporated. Some fraction of the time, at rate $\mu$, an event occurs where the appropriate nucleotides is replaced with a random nucleotide instead. In our model, the probability of selecting any of the nucleotides during one of these random replacement events is uniform (picking a C is just as probably as picking a G, for example), and the new nucleotide doesn't depend in any way on what nucleotide was there before. It is as if you had a bag containing a large number of C, G, T, and A nucleotides at equal frequencies. As you built the new DNA strand, every so often you would replace the nucleotide you should be adding with one you instead select by reaching into the bag and pick at random.

Two things:

  • In the figure you don't have reversals to A after there has been a substitution to another nucleotide - is this on purpose?

  • It's interesting you start this explanation in the context of DNA replication. When I talk about how this works, I explain it in the context of changes from the "ancestral" nucleotide site to the current site. "given that the ancestral sequence was A and today we see a C, through time it could have changed to A (no change), C, G, T, back to A, etc..." I mean, it's the same after all because DNA replication & inheritance from ancestor to descendant is the underlying process.

Add branch lengths?

We will start be calculating the probability of a single history of evolution for a single site on a single tree. This history is the full set of states at all nodes. These are added to the toy phylogeny in Figure \@ref(fig:inference-internal-states). I want to emphasize that this isn't a history we have any particular reason to believe, it is just one possible history of states randomly chosen from all the possible histories.

Not sure if you want to add and branch lengths at the end of this sentence as this is the other relevant parameter to specify the "complete" history of a site: This history is the full set of states at all nodes

add linear model to chapter 1

Add linear model to chapter 1 as example of relationshoip between observed values (y), unobserved values (x), model, and model parameters.

Typo

We can consider each edge in the focal phylogeny independently. For a given focal edge, we count the fraction of phylogenies in the sample that have the same edge, as determined by producing an identical split in taxa. We consider this frequency as the support for the focal edge. If the frequency is 1, the edge was in all the phylogenies in the sample. Of it is zero, it wasn't present in any of the phylogenies in the sample. The interpretation of these frequencies, which are often reported as percents, depends on the method that was used to generate the sample.

Of it is zero to If

typo and suggestion

The GTR model has $df=8$. There are $5$ stochastic relative rate parameters and $3$ stochastic equilibrium frequencies (Figure \@ref(fig:evaluation-models-nested)). The other models we have seen are nested within this. By nested I mean that they can take on a smaller subset of the values that the more complex model can. Models that are nested within other models have a smaller degree of freedom.

a smaller subset of the values that the more complex model can change that to than

I would also change values for parameters.

typos

Instead, we will forego dealing with $P(D)$ at all by approximating the posterior with Markov Chain Monte Carlo (MCMC) sampling[@metropolis1953]. MCMC is a widely used to approximate complex probability distributions that are two complex to calculate analytically. MCMC is implemented by proposing a series of hypothesis that are either rejected or accepted based on specially formulated test statistic $R$ and criteria for evaluating this statistic, such that the accepted hypotheses form a sample that is drawn from the distribution of interest. In our case, that distribution of interest is the posterior distribution.

distributions that are two complex --> distributions that are too complex

- Likelihood is the probability of the data given the phylogenetic hypothesis. The probability of the data under the ML hypothesis will be exceptionally low, often far less than one in a million, since the ML hypothesis could generate many other data as well. This is because we are evaluating the probability of the data given the hypothesis, but often what we really want to know is the probability of the hypothesis given the data. They two are quite different, but are related. It would be nice to be more explicit about that relationship.

They two are quite different --> The two are quite different

- Selecting the appropriate model. Many of hte same considerations we reviewed in the context of likelihood apply here.

Many of hte same considerations --> Many of the same considerations

references duplicated

References are in both the references section and at the end of each chapter. Need to remove them at the end of each chapter.

Root tree on node

phy_text = "(((Species_A:0.5,Species_B:0.5):0.5,Species_C:1.0):0.2,Species_D:1.2);"

Root this tree on an internal node rather than edge. Then add demonstration that re-rooting does not impact likelihood despite asymmetry of P. This builds on explanation from @mtholder:

'The probability matrix doesn't have to be symmetric. The constraint is similar to the "detailed balance" constraint in MCMC, namely:

\pi_i \Pr(destination=j | start=i) = \pi_j \Pr(destination=i | start=j)

So if the ratio of the probabilities of the "forward" and "reverse" substitutions is identical to the ratio of the equilibrium frequencies of the destination and source states, then placing the root at any point on the tree will result in the same likelihood. So, you can't infer the root position from the character state data alone.'

clarify outgroup rooting

Confusion in class how placing root in outgroup gives you the root of ingroup, since now you are discussing two roots in one tree but only one point can be the oldest. Talk more about how each subtree (clade) has a root, and we can talk about those roots in the context of a single tree.

Misc typos

Section 0.1 - guthub
Figure 1.1 Estimate
Silhouettes

Table suggestion

I think it might be prudent to use a box or a table to quickly summarize monophyly, polyphyly, and paraphyly in a manner that students could easily find during a study period, i.e.,

monophyly-group including common ancestor and all descendants
paraphyly-group including common ancestor but only some descendants
polyphyly-group including tips but not the common ancestor nor all descendants of the common ancestor shared by the included tips

Minor edit

We can now calculate the probability of a specific end state given a start state. Now what? Let's use these tools to simulate evolution along a single edge at a time, as in Figure \@ref(fig:sim-application)B.

I would change the order of clauses in this sentence, for example, I would start with Now what?

Typo

We are using a mammal tree, so let's pick some parameter values that roughly approximate what we see in mammals. Rather than set all the parameters independently, let's set up an HKY85 model, which accommodates non-uniform base frequencies and different transition/ tansversion ratios. First, we can clamp $\mu=1$. This basically just means the branch lengths will be in units of expected evolutionary change. Transitions (captured by parameters $b$ and $e$) are on the order of 4 times more frequent than transversions (captured by parameters $a$, $c$, $d$, and $f$) in mammals [@rosenberg2003]. So we will clamp $b=e=2$ and $a=c=d=f=0.5$. I picked these particular values (rather than others, such as 4 and 1) because the keep the average off-diagonal entries in $\mathbf{R}$ to 1.

At the end of this line, there seems to be a typo: because the keep the average

Change main font

Would like to change the main font to Gyre Schola.

Info on font selection in bookdown here - https://bookdown.org/yihui/rmarkdown-cookbook/latex-variables.html

Output specifics set in https://github.com/caseywdunn/phylogenetic_biology/blob/master/_output.yml . There can see that I am using latex engine xelatex. So font should be set with mainfont. That is how I am setting it now, but it doesn't work. Maybe this font is not installed on the local system?

In https://github.com/caseywdunn/phylogenetic_biology/blob/master/docker/Dockerfile, I do install it with tlmgr install tex-gyre. But that installs it as a tex package. I probably need to install it at a system level.

Typo

The most widely used approach to assessing confidence in maximum likelihood phylogenetic inference is the bootstrap [@felsenstein1985confidence]. Given a data matrix where rows are taxa and columns are characters (nucleotide sites in the case of DNA) with $n$ columns, bootstrapping generates a new matrix, also with $n$ columns, be resampling from the the original matrix with replacement. Some columns from the original matrix won't be sampled at all, some will be sampled once, and some will be sampled multiple times.

be resampling --> by

problems knitting to pdf

through chapter 6, had only run Buil Book with gitbook output. Now want to Build Book to pdf and ensure both output formats moving forward. Resolved some issues, but these still remain:

  • Table of contents not populated

  • citation, equation, and maybe other references are proken. Show up as ?? in text, and phylogenetic_biology.log has many warnings about these. Since they work fine for gitbook, seems like there is some global latex problem

  • There is a math formatting problem that aborts pdf build on page 52. This can be seen in tail of log file:

LaTeX Warning: Reference `eq:jc69' on page 53 undefined on input line 980.

! Missing $ inserted.
<inserted text> 
                $
l.983 
       
Here is how much of TeX's memory you used:
 18584 strings out of 479465
 323958 string characters out of 5881418
 785041 words of memory out of 5000000
 37582 multiletter control sequences out of 15000+600000
 541685 words of font info for 88 fonts, out of 8000000 for 9000
 14 hyphenation exceptions out of 8191
 84i,7n,118p,1251b,568s stack positions out of 5000i,500n,10000p,200000b,80000s

Output written on phylogenetic_biology.pdf (52 pages).

Some relevant links on this error:

https://www.overleaf.com/learn/latex/Errors/Missing%20$%20inserted

https://tex.stackexchange.com/questions/52804/missing-inserted-inserted-text (suggests that it could be a math character pulled in from bib)

Suggestion

Two additional model features are often added to address this rate heterogeneity - `I` and `G`. The `I` parameter is the fraction of sites that are invariant and effectively have a rate of zero. `G` refers to the discrete Gamma model of rate heterogeneity [@yang1994maximum], where each site is assigned to one of a set number (usually 4) rate categories. The distribution of rates across these categories is modeled with a parameter referred to as `\alpha`.

I would describe Gamma slightly differently.

G refers to the actual rate heterogeneity across variant sites. This is modeled with a continuous Gammadistribution that is discretized, usually in 4 rate categories. Sites are assigned to these different rate categories. To specify the shape of theGamma distribution, we use a parameter referred to as \alpha.

the discrete Gamma model of rate heterogeneity (Yang 1994), where each site is assigned to one of a set number (usually 4) rate categories. The distribution of rates across these categories is modeled with a parameter referred to as \alpha

Add a table showing relationships of all analyses

Rows are analysis types, columns are features (tree topology, tree edge lengths, tip node states, internal node, etc...) and cells are how each of these features are handled in the analysis (clamp, marginzlize, estimate, etc...)

for example:

What are we doing?

Sometimes estimates are nuisances

  • Clamping.

  • Estimate

    • Point estimate.
    • Distribution.
    • Optimize.
  • Marginalizing.

and here is a start at the table:

Goal,tree topology,tree edge lengths,tip node states,internal node
inference,estimate,estimate,clamp,marginzlize,clamp,estimate
simulate data,clamp,clamp,estimate,estimate,clamp,clamp
independent contrast,clamp,clamp,clamp,estimate,clamp,estimate
simulate tree,estimate,estimate,,,,

Goal,tree topology,tree edge lengths,tip node states,internal node

inference,estimate,estimate,clamp,marginzlize,clamp,estimate
simulate data,clamp,clamp,estimate,estimate,clamp,clamp
independent contrast,clamp,clamp,clamp,estimate,clamp,estimate
simulate tree,estimate,estimate,,,,

Goal,tree topology,tree edge lengths,tip node states,internal node

inference,estimate,estimate,clamp,marginzlize,clamp,estimate
simulate data,clamp,clamp,estimate,estimate,clamp,clamp
independent contrast,clamp,clamp,clamp,estimate,clamp,estimate
simulate tree,estimate,estimate,,,,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.