<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Proposal for tree JSONs rather than Newick + node JSONs about augur HOT 6 CLOSED

nextstrain commented on June 4, 2024

Proposal for tree JSONs rather than Newick + node JSONs

from augur.

Comments (6)

rneher commented on June 4, 2024 2

I understand the impetus to 'get-it-right-now'... and I appreciate all the other suggestions/changes over the last couple of days. but I am not convinced we know what is "right" with sufficient certainty to justify the complete re-org.

I do think we can find a good solution for the node_label/node_data consistency (if only by checking that both have identical sets of names and fail noisily if they don't).

from augur.

rneher commented on June 4, 2024

We had such a pattern in augurlinos, but abandoned it to make it more suitable for snakemake or similar. If you have a tree.json that gets updated by different steps, snakemake can't determine the order in which rules are run or whether a particular rule needs to be rerun because input data changed. to avoid this, we organized it such that each rule has explict in- and outputs.

If you call that tree.json differently after each rule (tree.json, tree_aa.json, tree_aa_traits.json, tree_aa_traits_titers.json), you end up with multiple sources of truth -- something that we are trying to avoid.

The node_data.json, traits.json etc with the common structure that can be globbed together clearly is a compromise that has its downsides. but the individual flat jsons can be readily inspected and the newick can be looked at in any tree viewer. The hierarchical tree json is much messier to look at.

from augur.

trvrb commented on June 4, 2024

We had such a pattern in augurlinos, but abandoned it to make it more suitable for snakemake or similar. If you have a tree.json that gets updated by different steps, snakemake can't determine the order in which rules are run or whether a particular rule needs to be rerun because input data changed. to avoid this, we organized it such that each rule has explict in- and outputs.

I'm sorry Richard, but I still disagree. We currently use a pattern of sequences.fasta --> filtered.fasta. This pattern is common in bioinformatics pipelines. I would suggest that a Snakefile that read in a linear top-to-bottom fashion where files go tree.json --> tree_dates.json --> tree_traits.json --> tree_aa.json would be an entirely familiar pattern and closer to how most people approach bioinformatics. Ie take a set off augur commands, one after the other, and encode this in snakemake form. Each output becomes the next rules input.

If you call that tree.json differently after each rule (tree.json, tree_aa.json, tree_aa_traits.json, tree_aa_traits_titers.json), you end up with multiple sources of truth -- something that we are trying to avoid.

No more than filtered.fasta and sequences.fasta are multiple sources of truth. Will be clear which tree.json is farthest down the line.

The node_data.json, traits.json etc with the common structure that can be globbed together clearly is a compromise that has its downsides. but the individual flat jsons can be readily inspected and the newick can be looked at in any tree viewer. The hierarchical tree json is much messier to look at.

Again, I'd suggest that looking at results in auspice is a big win. With split Newick + nodes there is no way to visualize results (from for example augur traits) until you get all the way to augur export.

from augur.

trvrb commented on June 4, 2024

Another plug for linear Snakemake flow... In the current setup, it's not at all obvious that you couldn't run augur traits on tree_raw.nwk + metadata.tsv. This will blow up because tree_raw.nwk does not have node labels. However, it seems entirely reasonable for someone to want to build an ML tree and then infer ancestral traits on the ML tree.

Tree JSONs that become increasingly annotated seems like a much more composable direction.

from augur.

rneher commented on June 4, 2024

There is no fundamental reason why it can't be done this way, but it doesn't come without draw backs. I disagree with our workflow being linear -- and it doesn't need to be linear either. Traits, translations, titers, frequencies all live independently of each other and can be run in parallel. There is no reason to string them together. Tools like snakemake are supposed to handle exactly this case: generate an acyclic (not necessarily linear) graph of dependencies.

Pros:

being able to view things in auspice is nice
it would get around the labeling issue, but we still need to respect order of rules. i.e. traits etc can only be run once the topology and branch length are final. The labeling issue in the current pipeline could be avoided by putting them into the newick right away.

Cons:

it will be opaque what the individual steps added to the json
hierarchial jsons are hard to inspect.
non-standard output files
we are stuck with the arcane json format of auspice (attr etc, but that of course could be changed as well).
the files will be quite bloated: the initial one is (tree, meta), then (tree, meta, sequences), (tree, meta, ancestral + sequences), (tree, meta, ancestral+sequences, traits), etc. Certainly a messier data duplication than sequences.fasta -> filtered.fasta

Either way, all of this can be done. But I have to say that I am somewhat annoyed that we are reopening this box now. We sketched out the basic pipeline in April in that google doc and that got implemented pretty much the way it was envisioned. Having spent a number of weeks to make it work (after having a previous prototype of this), restructuring the entire data flow is not exactly what I am looking forward to.

from augur.

trvrb commented on June 4, 2024

Sorry that this is so late. I had an attack of we-have-to-get-this-right before encouraging others to work in the system. Thank you for humoring me 🙂

I very much take the point about bloated uninspectable files and making it difficult to see what's been added by a step. I hadn't thought that part through enough.

This came from trying to look at the snakefile through fresh eyes and finding node_data.json and its dependence on a fully annotated Newick confusing.

I'm happy to close this issue now, but I would like to try to think of ways to make it more obvious what's happening in the pipeline and to make steps more composable. Adding taxon labels in augur tree seems helpful there. But we need to flag things if there's an operation that changes Newick structure so that Newick and JSON become incompatible. Rerooting in augur timetree does complicate things. Maybe we could do something like add a topo_version attribute to the top-level of the node JSON. Not 100% sure how this would work...

I'll try to clean things up in minor ways.

I would still like to think through #133.

from augur.

Proposal for tree JSONs rather than Newick + node JSONs about augur HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent