Git Product home page Git Product logo

ncov-recombinant's Introduction

ncov-recombinant

❗❗❗
Note: ncov-recombinant will be deprecated soon as SARS-CoV-2 recombination has evolved beyond the scope of this pipeline's design. The new tool will be rebar, which is under active development.
❗❗❗

All Contributors

License: MIT GitHub issues Install CI Pipeline CI

Reproducible workflow for SARS-CoV-2 recombinant sequence detection.

  1. Align sequences and perform clade/lineage assignments with Nextclade.
  2. Identify parental clades and plot recombination breakpoints with sc2rf.
  3. Create tables, plots, and powerpoint slides for reporting.

Please refer to the documentation for detailed instructions on installing, running, developing and much more!

Credits

ncov-recombinant is built and maintained by Katherine Eaton at the National Microbiology Laboratory (NML) of the Public Health Agency of Canada (PHAC).


Katherine Eaton

💻 📖 🎨 🤔 🚇 🚧

Thanks goes to these wonderful people (emoji key):


Nextstrain (Nextclade)

🔣 🔌

Lena Schimmel (sc2rf)

🔌

Yatish Turakhia (UShER)

🔣 🔌

Angie Hinrichs (UShER)

🔣 🔌

Benjamin Delisle

🐛 ⚠️

Vani Priyadarsini Ikkurthi

🐛 ⚠️

Mark Horsman

🤔 🎨

Jesse Bloom Lab

🔣 🔌

Dan Fornika

🤔 ⚠️

Tara Newman
🤔 ⚠️

This project follows the all-contributors specification. Contributions of any kind welcome!

ncov-recombinant's People

Contributors

allcontributors[bot] avatar ktmeaton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ncov-recombinant's Issues

Relax Nextclade filtering

  • Samples are being excluded because they have no labelled private mutations.
  • Experiment with keeping samples that also have privateNucMutations.unlabeledSubstitutions or privateNucMutations.reversionSubstitutions

Positive Controls

This is the list of implemented positive controls:

  • XA | Alpha recombinant (poor lineage accuracy)
  • XB | Conflicting designation issue
  • XC | Alpha recombinant (poor lineage accuracy)
  • XD
  • XE
  • XG
  • XH
  • XJ
  • XK | No public genomes
  • XL
  • XM
  • XN
  • XP
  • XQ
  • XR
  • XS
  • XT | No public genomes | Restricted to South Africa
  • XU | No public genomes | Restricted to India, Japan, Australia
  • XW | ...
  • XY | ...
  • XZ | ...
  • XAA | ...
  • XAB | ...
  • XAC | ...
  • XAD | ...
  • XAE | ...
  • XAF | ...
  • XAG | ...
  • XAH | ...
  • XAJ | ...
  • XAK | ...
  • XAL | ...
  • XAM | ...
  • XAN | ...
  • XAP | ...
  • XAQ | ...
  • XAR | ...
  • XAS | ...
  • XAT | ...
  • XAU | ...
  • XAV | ...
  • XAW | ...
  • XAY | ...
  • XAZ | ‼ Priority | Large Lineage

Update tutorial strains

The tutorial strains have high levels of ambiguity, maybe I should replace these with different sequences? But on the other hand, these are good examples of how the pipeline handles this ambiguity.

  • XM-example-2: Ns around breakpoints
  • proposed467-example-2: Ns around breakpoints
  • miscBA1BA2Post17k-example-1: Ns around breakpoints and IUPAC ambiguity
  • 19955,20055 are common sites to be ambiguous.

image

Add versions to reporting

Recombinant lineages change extremely rapidly. Include program versions for:

  • ncov-recombinant
  • nextclade
  • sc2rf
  • usher

Include dataset versions for:

Identify parent lineages with RIPPLES

I'd like to give RIPPLES another try, now that I'm more proficient with the matUtils commands.

Steps:

  • 1. Create a text file of cluster ids. This will be the first observed sequence for each recombinant lineage.
  • 2. Run ripples on those cluster ids.

Plot substitutions as ticks on breakpoints figure

This help in interpreting lineages that have been split into multiple clusters (ex. XM)

breakpoints_clade

This help in interpreting lineages with the same breakpoint and parents (but different subs). For example, XQ and XR differ by one sub (around 17500).
breakpoints_lineage

Detect duplicate sequences in subtrees

  • I'd like the option to detect duplicates strains based on a matching column (ex. genbank_accession or gisaid_epi_isl).
  • And then label or remove from the subtree JSON.

Upgrade UShER to v0.5.6

Something has changed in the implementation of UShER subtree extraction because an initial upgrade from v0.5.3 to v0.5.6 crashed the pipeline. But v0.5.6 has some interesting new options for subtrees, so I'll look into it!

Rules with empty log files

The following rules currently produce empty log files. If checked, log content has been added:

  • nextclade
  • sc2rf_recombinants
  • faToVcf
  • usher_metadata
  • summary
  • plot?

Identify parent lineages with LAPIS cov-spectrum

So I have a new idea for identifying the parent lineage.

England/MILK-3796834/2022 is an XM recombinant, with regions predicted by sc2rf:

  • 44:17410|Omicron/21K
  • 21618:29510|Omicron/21L

From nextclade, the mutations by region are as follows:

  • 44:17410
    • C241T,C2470T,A2832G,C3037T,T5386G,G8393A,C10029T,C10449A,A11537G,C12513T,T13195C,C14408T,C15240T
  • 21618:29510
    • C21618T,G21987A,T22200G,G22578A,C22674T,T22679C,C22686T,A22688G,G22775A,A22786C,G22813T,T22882G,G22992A,C22995A,A23013C,A23040G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25416T,C25584T,C26060T,C26270T,C26577G,G26709A,C26858T,A27259C,G27382C,A27383T,T27384C,C27807T,A28271T,C28311T,G28487A,G28881A,G28882A,G28883C,A29510C

And if we query these mutations in cov-spectrum with LAPIS...

Parent 1 | 44:17410

Parent 1 is mostly likely BA.1.1.10 (72%, 647/894.)

https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=pangoLineage&nucMutations=C241T,C2470T,A2832G,C3037T,T5386G,G8393A,C10029T,C10449A,A11537G,C12513T,T13195C,C14408T,C15240T

{
  "errors":[],
  "info": {
    "apiVersion":1, 
    "dataVersion":1656461191,
    "deprecationDate":null,
    "deprecationInfo":null,
    "acknowledgement":null
  },
  "data":[
    {"pangoLineage":"B.1.1","count":1},
    {"pangoLineage":"BA.1.1.18","count":30},
    {"pangoLineage":"BA.1.1.12","count":3},
    {"pangoLineage":"BA.1.1.10","count":647},
    {"pangoLineage":"BA.1.1","count":186},
    {"pangoLineage":"BA.1","count":27}
  ]
}

Parent 2 | 21618:29510

Parent 2 is mostly likely BA.2 (83%, 38/46). However, there is only one runner up, and it is BA.2.12.1 (17%) which falls within BA.2.

https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=pangoLineage&nucMutations=C21618T,G21987A,T22200G,G22578A,C22674T,T22679C,C22686T,A22688G,G22775A,A22786C,G22813T,T22882G,G22992A,C22995A,A23013C,A23040G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25416T,C25584T,C26060T,C26270T,C26577G,G26709A,C26858T,A27259C,G27382C,A27383T,T27384C,C27807T,A28271T,C28311T,G28487A,G28881A,G28882A,G28883C,A29510C

{
  "errors": [],
  "info":{
    "apiVersion":1,
    "dataVersion":1656461191,
    "deprecationDate":null,
    "deprecationInfo":null,
    "acknowledgement":null
  },
  "data":[
    {"pangoLineage":"BA.2","count":38},
    {"pangoLineage":"BA.2.12.1","count":8}
  ]
}

Resolving

There are a couple of options to resolve the proportions:

  • Exclude lineages by a hard cut-off (<1%, <10%, etc.)
  • Take the highest proportion lineage.
  • Consider lineages in descending order, and report a lineage if it is a sub-lineage of the one with the highest proportion.
Lineage Count Proportion Note
BA.1.1.10 647 72% Report
BA.1.1 186 21% Not sub-lineage
BA.1.1.18 30 3% Not sub-lineage
BA.1 27 3% Not sub-lineage
BA.1.1.12 3 <1% Exclude
B.1.1 1 <1% Exclude
Lineage Count Proportion Note
BA.2 38 83% Ignore, has sub-lineage
BA.2.12.1 8 17% Report, is sub-lineage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.