Git Product home page Git Product logo

Comments (6)

rneher avatar rneher commented on June 4, 2024 2

I understand the impetus to 'get-it-right-now'... and I appreciate all the other suggestions/changes over the last couple of days. but I am not convinced we know what is "right" with sufficient certainty to justify the complete re-org.

I do think we can find a good solution for the node_label/node_data consistency (if only by checking that both have identical sets of names and fail noisily if they don't).

from augur.

rneher avatar rneher commented on June 4, 2024

We had such a pattern in augurlinos, but abandoned it to make it more suitable for snakemake or similar. If you have a tree.json that gets updated by different steps, snakemake can't determine the order in which rules are run or whether a particular rule needs to be rerun because input data changed. to avoid this, we organized it such that each rule has explict in- and outputs.

If you call that tree.json differently after each rule (tree.json, tree_aa.json, tree_aa_traits.json, tree_aa_traits_titers.json), you end up with multiple sources of truth -- something that we are trying to avoid.

The node_data.json, traits.json etc with the common structure that can be globbed together clearly is a compromise that has its downsides. but the individual flat jsons can be readily inspected and the newick can be looked at in any tree viewer. The hierarchical tree json is much messier to look at.

from augur.

trvrb avatar trvrb commented on June 4, 2024

We had such a pattern in augurlinos, but abandoned it to make it more suitable for snakemake or similar. If you have a tree.json that gets updated by different steps, snakemake can't determine the order in which rules are run or whether a particular rule needs to be rerun because input data changed. to avoid this, we organized it such that each rule has explict in- and outputs.

I'm sorry Richard, but I still disagree. We currently use a pattern of sequences.fasta --> filtered.fasta. This pattern is common in bioinformatics pipelines. I would suggest that a Snakefile that read in a linear top-to-bottom fashion where files go tree.json --> tree_dates.json --> tree_traits.json --> tree_aa.json would be an entirely familiar pattern and closer to how most people approach bioinformatics. Ie take a set off augur commands, one after the other, and encode this in snakemake form. Each output becomes the next rules input.

If you call that tree.json differently after each rule (tree.json, tree_aa.json, tree_aa_traits.json, tree_aa_traits_titers.json), you end up with multiple sources of truth -- something that we are trying to avoid.

No more than filtered.fasta and sequences.fasta are multiple sources of truth. Will be clear which tree.json is farthest down the line.

The node_data.json, traits.json etc with the common structure that can be globbed together clearly is a compromise that has its downsides. but the individual flat jsons can be readily inspected and the newick can be looked at in any tree viewer. The hierarchical tree json is much messier to look at.

Again, I'd suggest that looking at results in auspice is a big win. With split Newick + nodes there is no way to visualize results (from for example augur traits) until you get all the way to augur export.

from augur.

trvrb avatar trvrb commented on June 4, 2024

Another plug for linear Snakemake flow... In the current setup, it's not at all obvious that you couldn't run augur traits on tree_raw.nwk + metadata.tsv. This will blow up because tree_raw.nwk does not have node labels. However, it seems entirely reasonable for someone to want to build an ML tree and then infer ancestral traits on the ML tree.

Tree JSONs that become increasingly annotated seems like a much more composable direction.

from augur.

rneher avatar rneher commented on June 4, 2024

There is no fundamental reason why it can't be done this way, but it doesn't come without draw backs. I disagree with our workflow being linear -- and it doesn't need to be linear either. Traits, translations, titers, frequencies all live independently of each other and can be run in parallel. There is no reason to string them together. Tools like snakemake are supposed to handle exactly this case: generate an acyclic (not necessarily linear) graph of dependencies.

Pros:

  • being able to view things in auspice is nice
  • it would get around the labeling issue, but we still need to respect order of rules. i.e. traits etc can only be run once the topology and branch length are final. The labeling issue in the current pipeline could be avoided by putting them into the newick right away.

Cons:

  • it will be opaque what the individual steps added to the json
  • hierarchial jsons are hard to inspect.
  • non-standard output files
  • we are stuck with the arcane json format of auspice (attr etc, but that of course could be changed as well).
  • the files will be quite bloated: the initial one is (tree, meta), then (tree, meta, sequences), (tree, meta, ancestral + sequences), (tree, meta, ancestral+sequences, traits), etc. Certainly a messier data duplication than sequences.fasta -> filtered.fasta

Either way, all of this can be done. But I have to say that I am somewhat annoyed that we are reopening this box now. We sketched out the basic pipeline in April in that google doc and that got implemented pretty much the way it was envisioned. Having spent a number of weeks to make it work (after having a previous prototype of this), restructuring the entire data flow is not exactly what I am looking forward to.

from augur.

trvrb avatar trvrb commented on June 4, 2024

Sorry that this is so late. I had an attack of we-have-to-get-this-right before encouraging others to work in the system. Thank you for humoring me 🙂

I very much take the point about bloated uninspectable files and making it difficult to see what's been added by a step. I hadn't thought that part through enough.

This came from trying to look at the snakefile through fresh eyes and finding node_data.json and its dependence on a fully annotated Newick confusing.

I'm happy to close this issue now, but I would like to try to think of ways to make it more obvious what's happening in the pipeline and to make steps more composable. Adding taxon labels in augur tree seems helpful there. But we need to flag things if there's an operation that changes Newick structure so that Newick and JSON become incompatible. Rerooting in augur timetree does complicate things. Maybe we could do something like add a topo_version attribute to the top-level of the node JSON. Not 100% sure how this would work...

I'll try to clean things up in minor ways.

I would still like to think through #133.

from augur.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.