Comments (18)
Along these lines, what about having augur tree label nodes in the output Newick as NODE_0000071,
that was the initial idea, but that doesn't work since rerooting and polytomy resolution changes topology/adds/removes nodes.
I had initially been reading the command as augur timetree and expecting it to do the one Unix-y task of taking a ML subs tree and outputting a timetree. I would then have augur ancestral to do ancestral state inference. This streamlines both modules and surfaces the hidden ancestral state functionality.
yes and no. timetree inference via treetime will in most cases do ancestral inference. One could do it twice, but then it would be crucial to do it after timetree inference.
node_data.json is opaque, but if augur timetree produced node_dates.json and augur ancestral produced node_nt_seqs.json
initially node_data.json was the only one. agree that the name doesn't make much sense anymore. but we also don't want a separate file for each individual attribute. if you call it dates, then branch lengths should be somewhere else and we get a massive proliferation of files.
augur tree is appropriately agnostic about the underlying program used to build the tree. It could be RAxML, FastTree, etc...
but this only works if we basically hide all method specific commands from the user. Otherwise, we get a big confusion with the hundreds of options that these different tools take. In this spirit, the current treetime module is just a wrapper around treetime that happens to produce output compatible with augur.
from augur.
I strongly agree with what you've proposed @trvrb. Unlike other augur commands, the treetime command is modal and bundles several things into one. It'd be good to unbundle those.
from augur.
Oh... and I thought of two other things that bothered me here:
-
Calling
augur treetime --timetree
is just weird. -
augur tree
is appropriately agnostic about the underlying program used to build the tree. It could be RAxML, FastTree, etc...augur tree
is about going from alignment to subs tree, irrespective of method to get there. However,augur treetime
breaks this and specifies program in the title. We could very well eventually want to allow LSD of some other method to infer timetrees for completeness.
from augur.
I agree with @rneher . Also, moving ancestral
out separately is only going to remove 2 (3 if you count the output vcf) arguments, so may not really change the complexity of the treetime
call.
I think given what treetime does it's acceptable that it has a lot of arguments, but I'd agree that they're not so straightforward at the moment and could be made clearer, possibly by renaming some, better help info, and mostly, good documentation somewhere.
I also agree that timetree/treetime is confusing. I mess this up on the regular π
from augur.
but this only works if we basically hide all method specific commands from the user. Otherwise, we get a big confusion with the hundreds of options that these different tools take.
I think this is a false dilemma. It's possible to expose only the most common method-specific options, if any, and also provide a general mechanism for passing any arbitrary option into specific methods without enumerating all of them.
There are several approaches we could take. Using augur tree
as an example, one approach might be:
-
Providing common method-specific options like
--raxml-foo
or--fasttree-bar-baz
which map to--foo
and--bar-baz
in RAxML and FastTree. These would be "supported" method options. -
Providing a
--method-options
option which is used like--method-options=--foo
. This provides people the rope they'd need to do what they wanted, but also implies that something custom is going on which requires you to know what you're doing ("you break it, you keep it").
Again, there are other ways to approach the syntax here, but the general idea is the same.
initially node_data.json was the only one. agree that the name doesn't make much sense anymore. but we also don't want a separate file for each individual attribute. if you call it dates, then branch lengths should be somewhere else and we get a massive proliferation of files.
On what order of magnitude would "massive" be?
from augur.
the number of files is ultimately not the issue (would be probably just 2-3 more). The more central issue is how to address nodes across steps.
from augur.
the number of files is ultimately not the issue (would be probably just 2-3 more). The more central issue is how to address nodes across steps.
Nod. It didn't seem like the number of files would actually be an issue, that's why I asked.
For the core issue of internal nodes, it seems that node annotations must happen after the topology is fixed, whichever step that may be. Is that correct?
from augur.
what we could do are simple consistency checks on load:
- name internal nodes in
augur tree
(these might change later) - if
read_node_data
is called, that implies a previous augur module requiring a tree was called. We could simply pass the tree file name to that function and check for consistency of names. - if an augur module is called with a 'raw' tree without internal labels, name them and rewrite the tree with nothing changed but the labels (kind of violates the in/output rules, but would enable somebody to go into the pipeline with an ML tree and a metadata table and perform
traits
. result would be atraits.json
and a tree with internal node labels, the latter would be a surprise but could be communicated in all caps...)
from augur.
Sounds reasonable, @rneher, except for one bit:
rewrite the tree with nothing changed but the labels (kind of violates the in/output rulesβ¦)
Overwriting an input file in-place is a huge taboo. Even with an all caps warning, it could easily be a nasty surprise. I strongly recommend not doing this, but finding another way of allowing on-the-fly node naming if a tree without them is provided. This could be, for example, augur traits
automatically running an augur assign-node-names
command with the "raw" tree as input and saving a new "named" tree as output and then automatically using that as the actual input to augur traits
.
But also, augur traits
could just require a tree with named nodes and tell you to run augur assign-node-names
if you have a tree without them.
from augur.
@rneher: I definitely agree on finding ways to do consistency checks.
@tsibley: I agree that silently rewriting files sounds like a bad idea.
from augur.
Overwriting an input file in-place is a huge taboo. Even with an all
caps warning, it could easily be a nasty surprise.
I agree. (but hey, gzip does it...)
but finding another way of allowing on-the-fly node
naming if a tree without them is provided. This could be, for example,
automatically running an |augur assign-node-names| command with the
"raw" tree as input and saving a new "named" tree as output and then
using that as the new input.
yes, I was thinking something similar.
But also, |augur traits| could just require a tree with named nodes and
tell you to run |augur assign-node-names| if you have a tree without them.
it sort of strikes me as clunky to have such an book-keeping step explicit.
from augur.
re: creating an ancestral
command (to be run after timetree temporal
or whatever it's called):
- π it is unnecessary compute, as it has to be done within timetree anyways
- π it's clearer what's going on
- π it'd be super useful to add ancestral states to a BEAST tree, where we don't want to mess with topology or branches. (This isn't meant to turn this thread into "how to deal with BEAST", but clearly it's something i've been doing and will move over to modular augur at some point)
For what it's worth, I like the requirement for "new" layers to simply write a flat JSON with the nodes
object and have that end up in the tree.JSON
from augur.
Okay. Latest thoughts...
It's clear that we need to support at least 3 big use cases.
- Going from alignment to ML tree and time tree via IQ-Tree + TreeTime and then traits, etc... + export
- Going from alignment to ML tree via IQ-Tree and then traits, etc... + export
- Going from BEAST tree to ML branch lengths and then traits, etc... + export
The current setup is actually not so bad for these three aims.
- Is accomplished with
augur tree
followed byaugur treetime
to producetree.nwk
andnode_data.json
that then form the basis of downstream operations. - Is accomplished with
augur tree
followed byaugur treetime
but asking treetime not to make a timetree. - Is accomplished with
augur treetime
but asking treetime to use the input tree as timetree.
I think the core operations could be made more obvious by a simple rename of augur treetime
to augur refine
. This makes it clear (at least to me) that you would produce a raw tree tree_raw.nwk
with augur tree
than then needs to be refined by running augur refine
before doing any downstream operations. This seems semantically logical to me. You would still need to "refine" a BEAST tree even if it doesn't need timing done.
In one example... there was the issue of wanting to add traits to an ML tree (option 2 above). In this case, I'd recommend to enforce augur traits
to take node_data.json
as input. This way it's clear that you go augur tree
, augur refine
, augur traits
. Additionally, augur refine
could handle rerooting the ML tree via --root NODENAME
as it currently does. augur tree
wouldn't expose rooting options of IQ-Tree / RAxML.
I know this is basically exactly as things currently exist. Would just work on the exact interface to augur treetime
/ augur refine
.
from augur.
an argument in favor of pulling sequences out of node_data.json
and putting them into a separate file:
this would align the file I/O pattern between the vcf/fasta workflows. I am fine with splitting sequence reconstruction from rerooting/timetree if this makes is more transparent. the computational overhead is small.
from augur.
392c17c checks whether names in node_data
match nodes in tree if provided. We should more generally discuss how we want to handle errors and warnings.
from augur.
I hadn't noticed that VCF version of augur treetime
produces three file outputs https://github.com/nextstrain/augur/blob/master/builds/tb/Snakefile#L68. To me, this is one output too many and would push for separating augur refine
from augur ancestral
. augur ancestral
would produce a single output file (likeaugur traits
) that's something like nt_muts.json
(or I guess nt_muts.vcf
).
Looking at tb/results/node_data.json
I see things like "mutations": [["A", 600688, "C"]
, so I don't quite understand why treetime.vcf
needs to exist (I see a corresponding line that reads MTB_anc 600688 . A C
, but this is a separate issue and likely due to my own naivety with what's going on with the VCF pipeline.
Edit: I see... The difference is that node_data.json
gets a sequence
attribute. Vs VCF where this attribute is not supplied.
from augur.
I agree this needs streamlining. sequences
in node_data.json
was really just a temporary fix when I realized I need them in translate
.
from augur.
I'm going to close this now given that #145 is merged. There were some issues surfaced above (like passing arguments to IQTREE
and sequences
in node data), but I think these exist better as their own issues.
from augur.
Related Issues (20)
- frequencies: error with `--region` flag HOT 3
- Improve validation output to identify problematic nodes / properties
- `parse` silently removes spaces from record ids in the sequence output but not in the metadata output HOT 1
- `measurements export` does not consistently allow the strain column to be used as a grouping column
- Export schema wrongly fails on gene names starting with 'nuc' due to lookahead
- align: error message when reading a reference sequence does not completely explain the root issue
- Add schema for node-data JSONs HOT 1
- Allow custom date column name to be specified in `refine` - similar to `metadata-id-column` HOT 1
- Add docs regarding bootstraps
- Clarification on augur tree --exclude-sites masking HOT 4
- Make command line option headings linkable
- Augur export error HOT 2
- `augur align --method nextclade` should wrap `nextclade run` HOT 1
- pip/conda dependency version constraints not guaranteed in all environments
- export: Add option to extend the default lat/longs HOT 3
- Number of Nt changes is different from number of mutations (divergence)
- Use PyPI's pyright? HOT 2
- Augur 24.4.0 release
- Review Pyright rule exceptions HOT 2
- Support pandas version 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from augur.