Comments (6)
I understand the impetus to 'get-it-right-now'... and I appreciate all the other suggestions/changes over the last couple of days. but I am not convinced we know what is "right" with sufficient certainty to justify the complete re-org.
I do think we can find a good solution for the node_label/node_data consistency (if only by checking that both have identical sets of names and fail noisily if they don't).
from augur.
We had such a pattern in augurlinos, but abandoned it to make it more suitable for snakemake or similar. If you have a tree.json that gets updated by different steps, snakemake can't determine the order in which rules are run or whether a particular rule needs to be rerun because input data changed. to avoid this, we organized it such that each rule has explict in- and outputs.
If you call that tree.json differently after each rule (tree.json, tree_aa.json, tree_aa_traits.json, tree_aa_traits_titers.json), you end up with multiple sources of truth -- something that we are trying to avoid.
The node_data.json, traits.json etc with the common structure that can be globbed together clearly is a compromise that has its downsides. but the individual flat jsons can be readily inspected and the newick can be looked at in any tree viewer. The hierarchical tree json is much messier to look at.
from augur.
We had such a pattern in augurlinos, but abandoned it to make it more suitable for snakemake or similar. If you have a tree.json that gets updated by different steps, snakemake can't determine the order in which rules are run or whether a particular rule needs to be rerun because input data changed. to avoid this, we organized it such that each rule has explict in- and outputs.
I'm sorry Richard, but I still disagree. We currently use a pattern of sequences.fasta
--> filtered.fasta
. This pattern is common in bioinformatics pipelines. I would suggest that a Snakefile that read in a linear top-to-bottom fashion where files go tree.json
--> tree_dates.json
--> tree_traits.json
--> tree_aa.json
would be an entirely familiar pattern and closer to how most people approach bioinformatics. Ie take a set off augur
commands, one after the other, and encode this in snakemake form. Each output becomes the next rules input.
If you call that tree.json differently after each rule (tree.json, tree_aa.json, tree_aa_traits.json, tree_aa_traits_titers.json), you end up with multiple sources of truth -- something that we are trying to avoid.
No more than filtered.fasta
and sequences.fasta
are multiple sources of truth. Will be clear which tree.json
is farthest down the line.
The node_data.json, traits.json etc with the common structure that can be globbed together clearly is a compromise that has its downsides. but the individual flat jsons can be readily inspected and the newick can be looked at in any tree viewer. The hierarchical tree json is much messier to look at.
Again, I'd suggest that looking at results in auspice is a big win. With split Newick + nodes there is no way to visualize results (from for example augur traits
) until you get all the way to augur export
.
from augur.
Another plug for linear Snakemake flow... In the current setup, it's not at all obvious that you couldn't run augur traits
on tree_raw.nwk
+ metadata.tsv
. This will blow up because tree_raw.nwk
does not have node labels. However, it seems entirely reasonable for someone to want to build an ML tree and then infer ancestral traits on the ML tree.
Tree JSONs that become increasingly annotated seems like a much more composable direction.
from augur.
There is no fundamental reason why it can't be done this way, but it doesn't come without draw backs. I disagree with our workflow being linear -- and it doesn't need to be linear either. Traits, translations, titers, frequencies all live independently of each other and can be run in parallel. There is no reason to string them together. Tools like snakemake are supposed to handle exactly this case: generate an acyclic (not necessarily linear) graph of dependencies.
Pros:
- being able to view things in auspice is nice
- it would get around the labeling issue, but we still need to respect order of rules. i.e. traits etc can only be run once the topology and branch length are final. The labeling issue in the current pipeline could be avoided by putting them into the newick right away.
Cons:
- it will be opaque what the individual steps added to the json
- hierarchial jsons are hard to inspect.
- non-standard output files
- we are stuck with the arcane json format of auspice (attr etc, but that of course could be changed as well).
- the files will be quite bloated: the initial one is (tree, meta), then (tree, meta, sequences), (tree, meta, ancestral + sequences), (tree, meta, ancestral+sequences, traits), etc. Certainly a messier data duplication than sequences.fasta -> filtered.fasta
Either way, all of this can be done. But I have to say that I am somewhat annoyed that we are reopening this box now. We sketched out the basic pipeline in April in that google doc and that got implemented pretty much the way it was envisioned. Having spent a number of weeks to make it work (after having a previous prototype of this), restructuring the entire data flow is not exactly what I am looking forward to.
from augur.
Sorry that this is so late. I had an attack of we-have-to-get-this-right before encouraging others to work in the system. Thank you for humoring me 🙂
I very much take the point about bloated uninspectable files and making it difficult to see what's been added by a step. I hadn't thought that part through enough.
This came from trying to look at the snakefile through fresh eyes and finding node_data.json
and its dependence on a fully annotated Newick confusing.
I'm happy to close this issue now, but I would like to try to think of ways to make it more obvious what's happening in the pipeline and to make steps more composable. Adding taxon labels in augur tree
seems helpful there. But we need to flag things if there's an operation that changes Newick structure so that Newick and JSON become incompatible. Rerooting in augur timetree
does complicate things. Maybe we could do something like add a topo_version
attribute to the top-level of the node JSON. Not 100% sure how this would work...
I'll try to clean things up in minor ways.
I would still like to think through #133.
from augur.
Related Issues (20)
- frequencies: error with `--region` flag HOT 3
- Improve validation output to identify problematic nodes / properties
- `parse` silently removes spaces from record ids in the sequence output but not in the metadata output HOT 1
- `measurements export` does not consistently allow the strain column to be used as a grouping column
- Export schema wrongly fails on gene names starting with 'nuc' due to lookahead
- align: error message when reading a reference sequence does not completely explain the root issue
- Add schema for node-data JSONs HOT 1
- Allow custom date column name to be specified in `refine` - similar to `metadata-id-column` HOT 1
- Add docs regarding bootstraps
- Clarification on augur tree --exclude-sites masking HOT 4
- Make command line option headings linkable
- Augur export error HOT 2
- `augur align --method nextclade` should wrap `nextclade run` HOT 1
- pip/conda dependency version constraints not guaranteed in all environments
- export: Add option to extend the default lat/longs HOT 3
- Number of Nt changes is different from number of mutations (divergence)
- Use PyPI's pyright? HOT 2
- Augur 24.4.0 release
- Review Pyright rule exceptions HOT 2
- Support pandas version 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from augur.