ktmeaton / ncov-recombinant Goto Github PK
View Code? Open in Web Editor NEWReproducible workflow for SARS-CoV-2 recombinant sequence detection.
License: MIT License
Reproducible workflow for SARS-CoV-2 recombinant sequence detection.
License: MIT License
So I have a new idea for identifying the parent lineage.
England/MILK-3796834/2022
is an XM
recombinant, with regions predicted by sc2rf
:
44:17410|Omicron/21K
21618:29510|Omicron/21L
From nextclade
, the mutations by region are as follows:
44:17410
C241T,C2470T,A2832G,C3037T,T5386G,G8393A,C10029T,C10449A,A11537G,C12513T,T13195C,C14408T,C15240T
21618:29510
C21618T,G21987A,T22200G,G22578A,C22674T,T22679C,C22686T,A22688G,G22775A,A22786C,G22813T,T22882G,G22992A,C22995A,A23013C,A23040G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25416T,C25584T,C26060T,C26270T,C26577G,G26709A,C26858T,A27259C,G27382C,A27383T,T27384C,C27807T,A28271T,C28311T,G28487A,G28881A,G28882A,G28883C,A29510C
And if we query these mutations in cov-spectrum with LAPIS...
44:17410
Parent 1 is mostly likely BA.1.1.10
(72%, 647/894.)
{
"errors":[],
"info": {
"apiVersion":1,
"dataVersion":1656461191,
"deprecationDate":null,
"deprecationInfo":null,
"acknowledgement":null
},
"data":[
{"pangoLineage":"B.1.1","count":1},
{"pangoLineage":"BA.1.1.18","count":30},
{"pangoLineage":"BA.1.1.12","count":3},
{"pangoLineage":"BA.1.1.10","count":647},
{"pangoLineage":"BA.1.1","count":186},
{"pangoLineage":"BA.1","count":27}
]
}
21618:29510
Parent 2 is mostly likely BA.2
(83%, 38/46). However, there is only one runner up, and it is BA.2.12.1
(17%) which falls within BA.2
.
{
"errors": [],
"info":{
"apiVersion":1,
"dataVersion":1656461191,
"deprecationDate":null,
"deprecationInfo":null,
"acknowledgement":null
},
"data":[
{"pangoLineage":"BA.2","count":38},
{"pangoLineage":"BA.2.12.1","count":8}
]
}
There are a couple of options to resolve the proportions:
Lineage | Count | Proportion | Note |
---|---|---|---|
BA.1.1.10 | 647 | 72% | Report |
BA.1.1 | 186 | 21% | Not sub-lineage |
BA.1.1.18 | 30 | 3% | Not sub-lineage |
BA.1 | 27 | 3% | Not sub-lineage |
BA.1.1.12 | 3 | <1% | Exclude |
B.1.1 | 1 | <1% | Exclude |
Lineage | Count | Proportion | Note |
---|---|---|---|
BA.2 | 38 | 83% | Ignore, has sub-lineage |
BA.2.12.1 | 8 | 17% | Report, is sub-lineage |
privateNucMutations.unlabeledSubstitutions
or privateNucMutations.reversionSubstitutions
This is the list of implemented positive controls:
I'd like to give RIPPLES
another try, now that I'm more proficient with the matUtils
commands.
Steps:
Applies to rule nextclade_recombinants
Ideally, make all columns filters.
ex. lineage_usher
This became apparent once I hit the 400+ sample range.
The following rules currently produce empty log files. If checked, log content has been added:
nextclade
sc2rf_recombinants
faToVcf
usher_metadata
summary
plot
?One option is to have alt rules for plot
and report
.
plot_historical
report_historical
This is reported in the column cov-spectrum_query
in the linelists. The subs listed must be shared by all samples within the cluster.
Recombinant lineages change extremely rapidly. Include program versions for:
Include dataset versions for:
The reason is to simplify maintenance, as I need to update a large number of profiles each time I want to update the base profobuf.
Fix by specifying plt.rcParams["svg.fonttype"] = "none"
There was improper error handling for how to plot (or not plot) breakpoints when none were identified. This has been fixed in scripts/plot_breakpoints.py
as validated with the controls-negative
dataset.
gisaid_epi_isl
shouldn't be a default metadata column, I need to remove that from the extra_cols
param.
Something has changed in the implementation of UShER
subtree extraction because an initial upgrade from v0.5.3
to v0.5.6
crashed the pipeline. But v0.5.6
has some interesting new options for subtrees, so I'll look into it!
genbank_accession
or gisaid_epi_isl
).There is too much content in a changelog to fit in the small text box.
XAK has been flagged as a VUM by the ECDC:
I think it looks cleaner
The tutorial strains have high levels of ambiguity, maybe I should replace these with different sequences? But on the other hand, these are good examples of how the pipeline handles this ambiguity.
XM-example-2
: Ns around breakpointsproposed467-example-2
: Ns around breakpointsmiscBA1BA2Post17k-example-1
: Ns around breakpoints and IUPAC ambiguity19955,20055
are common sites to be ambiguous.Datasets:
The script create_profile.sh
will error out in some cases when there is no empty line between fasta sequences.
We can parse out the gisaid accessions (EPI_ISL_*) from the strain names into a new column.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.