The ncov-recombinant's discuss from ktmeaton

Lineage counts and statuses in report slides are incorrect

Identify parent lineages with LAPIS cov-spectrum

So I have a new idea for identifying the parent lineage.

England/MILK-3796834/2022 is an XM recombinant, with regions predicted by sc2rf:

44:17410|Omicron/21K
21618:29510|Omicron/21L

From nextclade, the mutations by region are as follows:

44:17410
- C241T,C2470T,A2832G,C3037T,T5386G,G8393A,C10029T,C10449A,A11537G,C12513T,T13195C,C14408T,C15240T
21618:29510
- C21618T,G21987A,T22200G,G22578A,C22674T,T22679C,C22686T,A22688G,G22775A,A22786C,G22813T,T22882G,G22992A,C22995A,A23013C,A23040G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25416T,C25584T,C26060T,C26270T,C26577G,G26709A,C26858T,A27259C,G27382C,A27383T,T27384C,C27807T,A28271T,C28311T,G28487A,G28881A,G28882A,G28883C,A29510C

And if we query these mutations in cov-spectrum with LAPIS...

Parent 1 | `44:17410`

Parent 1 is mostly likely BA.1.1.10 (72%, 647/894.)

https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=pangoLineage&nucMutations=C241T,C2470T,A2832G,C3037T,T5386G,G8393A,C10029T,C10449A,A11537G,C12513T,T13195C,C14408T,C15240T

{
  "errors":[],
  "info": {
    "apiVersion":1, 
    "dataVersion":1656461191,
    "deprecationDate":null,
    "deprecationInfo":null,
    "acknowledgement":null
  },
  "data":[
    {"pangoLineage":"B.1.1","count":1},
    {"pangoLineage":"BA.1.1.18","count":30},
    {"pangoLineage":"BA.1.1.12","count":3},
    {"pangoLineage":"BA.1.1.10","count":647},
    {"pangoLineage":"BA.1.1","count":186},
    {"pangoLineage":"BA.1","count":27}
  ]
}

Parent 2 | `21618:29510`

Parent 2 is mostly likely BA.2 (83%, 38/46). However, there is only one runner up, and it is BA.2.12.1 (17%) which falls within BA.2.

https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=pangoLineage&nucMutations=C21618T,G21987A,T22200G,G22578A,C22674T,T22679C,C22686T,A22688G,G22775A,A22786C,G22813T,T22882G,G22992A,C22995A,A23013C,A23040G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25416T,C25584T,C26060T,C26270T,C26577G,G26709A,C26858T,A27259C,G27382C,A27383T,T27384C,C27807T,A28271T,C28311T,G28487A,G28881A,G28882A,G28883C,A29510C

{
  "errors": [],
  "info":{
    "apiVersion":1,
    "dataVersion":1656461191,
    "deprecationDate":null,
    "deprecationInfo":null,
    "acknowledgement":null
  },
  "data":[
    {"pangoLineage":"BA.2","count":38},
    {"pangoLineage":"BA.2.12.1","count":8}
  ]
}

Resolving

There are a couple of options to resolve the proportions:

Exclude lineages by a hard cut-off (<1%, <10%, etc.)
Take the highest proportion lineage.
Consider lineages in descending order, and report a lineage if it is a sub-lineage of the one with the highest proportion.

Lineage	Count	Proportion	Note
BA.1.1.10	647	72%	Report
BA.1.1	186	21%	Not sub-lineage
BA.1.1.18	30	3%	Not sub-lineage
BA.1	27	3%	Not sub-lineage
BA.1.1.12	3	<1%	Exclude
B.1.1	1	<1%	Exclude

Lineage	Count	Proportion	Note
BA.2	38	83%	Ignore, has sub-lineage
BA.2.12.1	8	17%	Report, is sub-lineage

Relax Nextclade filtering

Samples are being excluded because they have no labelled private mutations.
Experiment with keeping samples that also have privateNucMutations.unlabeledSubstitutions or privateNucMutations.reversionSubstitutions

Add cumulative counts of recombinant lineages and sequences

Positive Controls

This is the list of implemented positive controls:

Error in nextclade_dataset

There are challenges with certain security issues (certificate validation) in downloading the nextclade dataset.
There is a patch regarding a similar issue in nextstrain: nextstrain/ncov#875

Identify parent lineages with RIPPLES

I'd like to give RIPPLES another try, now that I'm more proficient with the matUtils commands.

Steps:

1. Create a text file of cluster ids. This will be the first observed sequence for each recombinant lineage.
2. Run ripples on those cluster ids.

Relax clade exclusion filter

Applies to rule nextclade_recombinants

Remove subtrees

Consider removing steps to extract subtrees

Experiment with breakpoint motifs for sc2rf

Add better filters to the Auspice JSON output

Ideally, make all columns filters.

ex. lineage_usher

Collapsing subtrees is inefficient and slow

This became apparent once I hit the 400+ sample range.

Separate recombinant clusters grouped into one cluster_id

Rules with empty log files

The following rules currently produce empty log files. If checked, log content has been added:

Add list of strains for each subtree

Plot historical data of recombinants

One option is to have alt rules for plot and report.

plot_historical
report_historical

Report cov-spectrum queries for each recombinant lineage

This is reported in the column cov-spectrum_query in the linelists. The subs listed must be shared by all samples within the cluster.

Add versions to reporting

Recombinant lineages change extremely rapidly. Include program versions for:

ncov-recombinant
nextclade
sc2rf
usher

Include dataset versions for:

nextclade-data
public-latest

Create documentation on ReadTheDocs

Sync breakpoint plotting to the reporting period if plot

Use latest protobufs rather than time-stamped

The reason is to simplify maintenance, as I need to update a large number of profiles each time I want to update the base profobuf.

Restore the growth calculation that reports gain since the previous week

Issue number is a float rather than a string in linelists and reports

Compare output to previous pipeline run to identify dropouts or changes in lineage assignments

Create an alternate report that has sensitive columns removed

Vector graphics output (svg) has font issues on import to Affinity

Fix by specifying plt.rcParams["svg.fonttype"] = "none"

plot: AttributeError: 'float' object has no attribute 'split'

There was improper error handling for how to plot (or not plot) breakpoints when none were identified. This has been fixed in scripts/plot_breakpoints.py as validated with the controls-negative dataset.

CI is crashing at usher_subtree now that usher metadata is added

column "gisaid_epi_isl" not existed in file

gisaid_epi_isl shouldn't be a default metadata column, I need to remove that from the extra_cols param.

Upgrade UShER to v0.5.6

Something has changed in the implementation of UShER subtree extraction because an initial upgrade from v0.5.3 to v0.5.6 crashed the pipeline. But v0.5.6 has some interesting new options for subtrees, so I'll look into it!

Downstream steps fail if Nextclade finds no recombinants

Detect duplicate sequences in subtrees

I'd like the option to detect duplicates strains based on a matching column (ex. genbank_accession or gisaid_epi_isl).
And then label or remove from the subtree JSON.

Remove CHANGELOG from report slides

There is too much content in a changelog to fit in the small text box.

Add XAK to controls

XAK has been flagged as a VUM by the ECDC:

https://www.ecdc.europa.eu/en/covid-19/variants-concern

Create validate rule for controlled datasets

Try setting linewidth to 0 for stacked bar charts

I think it looks cleaner

Update tutorial strains

The tutorial strains have high levels of ambiguity, maybe I should replace these with different sequences? But on the other hand, these are good examples of how the pipeline handles this ambiguity.

XM-example-2: Ns around breakpoints
proposed467-example-2: Ns around breakpoints
miscBA1BA2Post17k-example-1: Ns around breakpoints and IUPAC ambiguity
19955,20055 are common sites to be ambiguous.

Detect recombination with BA.5

Datasets:

BA.2* and BA.5.1* | proposed771
BA.5.2.1 and (BA.4* or BA.2*) | proposed820

ktmeaton / ncov-recombinant Goto Github PK

ncov-recombinant's Issues

Parent 1 | 44:17410

Parent 2 | 21618:29510

Resolving

Recommend Projects

Recommend Topics

Recommend Org

Parent 1 | `44:17410`

Parent 2 | `21618:29510`