Git Product home page Git Product logo

sars-cov-2's Introduction

home heroText description
true
Global platform for the analysis of SARS-CoV-2 data: Genomics, Cheminformatics, and Proteomics
Using open source tools and public cyberinfrastructure for transparent, reproducible analyses of viral datasets.

Visit our new SARS-CoV-2 surveillance site!

As of January 2022 this site will no longer be updated. Please visit our new site at https://galaxyproject.org/projects/covid19/ for the latest information on our surveillance efforts!

The goal of this resource is to provide publicly accessible infrastructure and workflows for SARS-CoV-2 data analyses. We currently feature three different types of analyses: Genomics, Cheminformatics, and Proteomics.

The analyses have been performed using the Galaxy platform and open source tools from BioConda. Tools were run using XSEDE resources maintained by the Texas Advanced Computing Center (TACC), Pittsburgh Supercomputing Center (PSC), and Indiana University in the U.S., de.NBI, VSC cloud resources and IFB cluster resources on the European side, STFC-IRIS at the Diamond Light Source, and ARDC cloud resources in Australia.

Galaxy Project   European Galaxy Project   Australian Galaxy Project   bioconda   XSEDE   TACC   de.NBI   ELIXIR   PSC   Indiana University   Galaxy Training Network   Bio Platforms Australia   Australian Research Data Commons   VIB   ELIXIR Belgium   Vlaams Supercomputer Center   EOSC-Life   Datamonkey   IFB   GalaxyP   NIAID   NHGRI  

sars-cov-2's People

Contributors

afgane avatar beatrizserrano avatar bedroesb avatar bgruening avatar dannon avatar delphine-l avatar dependabot[bot] avatar frederikcoppens avatar github-actions[bot] avatar gmauro avatar harry-stark avatar hexylena avatar ieguinoa avatar jxtx avatar lecorguille avatar lynnlangit avatar mmiladi avatar mvdbeek avatar nekrut avatar olegzharkov avatar pvanheus avatar reskyner avatar simonbray avatar slugger70 avatar spond avatar stevenweaver avatar subinamehta avatar tdudgeon avatar tnabtaf avatar wm75 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sars-cov-2's Issues

Error while running gffread.

I am getting an error while running gffread.

No fasta index found for genomeref.fa. Rebuilding, please wait..
Error: sequence lines in a FASTA record must have the same length!

License

Is there a license that governs the use/distribution of these workflows?

Should variation analysis build on preprocessing?

My naive expectation was that I could use the output of the preprocessing workflow (at least the Illumina parts of it) as input to the variation analysis workflow, but this doesn't seem to be the case.

The public history seems to use input Illumina data that has been preprocessed (the stdout contains corresponding info), but it's unclear how exactly.

Is it possible to adjust the workflows to make them work together?

Wrong page included in /genomics/no-more-business-as-usual/

The "Deploy" section of the No more business as usual page repeats the same content as the "RecombinationSelection" section. Both use a <Content> component to include the page /genomics/no-more-business-as-usual/6-RecombinationSelection/, but I assume the "Deploy" section should include /genomics/no-more-business-as-usual/7-VariantsDescription/ instead.

## RecombinationSelection
<Content :page-key="$site.pages.find(p => p.path === '/genomics/no-more-business-as-usual/6-RecombinationSelection/').key"/>

## Deploy
<Content :page-key="$site.pages.find(p => p.path === '/genomics/no-more-business-as-usual/6-RecombinationSelection/').key"/>

Consensus genome and variant calling from Illumina only

Hi everyone

I have been working with the National Institute for Communicable Diseases (NICD) on some of their SARS-CoV-2 sequencing. I adapted the "variation" workflow to call variants in Illumina data and also produce an inferred consensus genome. The current "Assembly" workflow assumes that you have access to both Illumina and Nanopore genomes for a sample, which is a pretty rare situation. My workflow (with some TODOs in the step annotation) is in a gist. Comments and additions are welcome!

If it is found to be useful perhaps it can be incorporated into the COVID-19 resources page.

P.S. for those with ARTIC Amplicon data I created a workflow for analysing that as did Thanh le Viet. Pasting them here in case they are useful.

Source for number of variants and samples on main page

Currently, the main page states this under Results for Genomics:

These lists are updated daily. There are 397 sites showing intra-host variation across 33 samples (with frequencies between 5% and 95%). Twenty nine samples have fixed differences at 39 sites from the published reference.

This leaves two questions:

  1. when have these numbers been updated last (based on which samples)?
  2. when is a variant considered a fixed difference?

When I analyzed https://covid19.galaxyproject.org/genomics/4-Variation/variant_list.tsv today, I got this for the filter condition 0.95 >= float(af) >= 0.05:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259

For the fixed differences I tried float(af) == 1.0 giving:
Samples with variants: 55
Total number of variants observed: 27
Number of sites observed to carry variants: 27

and float(af) > 0.95 resulting in:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259

Should we rename MCRA to MRCA?

Unfortunately, I did not have time to read up on the question of which is the right acronym for most recent common ancestor analysis before, but from a bit of research now MRCA seems to be in much wider use (and also makes more sense). Whenever you find MCRA it looks more like a typo on pages that also use MRCA.

It would require a bit of effort though to change all occurrences of MCRA in this repo, but also in the workflow and the notebook.

snpeff build issues with NC_045512.2 genbank record

I noticed today (after a user complaint) that apparently version 4.3 of SnpEff, which is used in the Variation workflows, fails to parse the NC_045512.2 genbank file correctly at the snpeff build step.

This leads to:

  1. missing annotations of effects on the mature peptides produced from the ORF1ab precursor
  2. wrong reporting of the ORF1ab AA length as 7101 instead of 7096 in snpeff's EFF field
  3. shifted amino acid numbers at ORF1ab positions past the ribosomal slippage site.
    All reported AA positions for variants downstream of this site seem to be shifted by 5 compared to the reference sequence.

None of this affects the tabular reports produced by SnpSift because this one does not provide these bits of info, but it means that all VCFs produced by the workflows suffer from the annotation error in 3. (though 2. provides a hint at what happened).

The underlying bug in SnpEff has been fixed in the recent 4.5covid19 release of snpeff, which also provides a built-in NC_045512.2 genome file. Annotations with either that built-in genome or a genome rebuilt from the genbank file with the 4.5covid19 snpeff version, yield identical and expected results.

Compare this output snippet of snpeff v4.3 for SRR10971381 taken from https://usegalaxy.org/datasets/bbd44e69cb8906b5ace4f1e83ab23509/display/?preview=True:

NC_045512 | 14268 | . | G | A | 103.0 | PASS | DP=473;AF=0.019027;SB=1;DP4=179,284,3,7;EFF=SYNONYMOUS_CODING(LOW\|SILENT\|acG/acA\|T4673\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14272 | . | G | T | 71.0 | PASS | DP=374;AF=0.016043;SB=3;DP4=141,226,4,3;EFF=STOP_GAINED(HIGH\|NONSENSE\|Gag/Tag\|E4675*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|T)
NC_045512 | 14309 | . | G | A | 79.0 | PASS | DP=334;AF=0.020958;SB=1;DP4=119,208,3,4;EFF=STOP_GAINED(HIGH\|NONSENSE\|tGg/tAg\|W4687*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14310 | . | G | A | 86.0 | PASS | DP=303;AF=0.023102;SB=13;DP4=114,181,0,7;EFF=STOP_GAINED(HIGH\|NONSENSE\|tgG/tgA\|W4687*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14313 | . | T | C | 317.0 | PASS | DP=301;AF=0.073090;SB=21;DP4=109,158,3,20;EFF=SYNONYMOUS_CODING(LOW\|SILENT\|gaT/gaC\|D4688\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|C)

to the corresponding bit of output from v4.5covid19:

NC_045512.2	14268	.	G	A	103.0	PASS	DP=473;AF=0.019027;SB=1;DP4=179,284,3,7;EFF=SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T4668|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T276|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2	14272	.	G	T	71.0	PASS	DP=374;AF=0.016043;SB=3;DP4=141,226,4,3;EFF=STOP_GAINED(HIGH|NONSENSE|Gag/Tag|E4670*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|T),STOP_GAINED(HIGH|NONSENSE|Gag/Tag|E278*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|T|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2	14309	.	G	A	79.0	PASS	DP=334;AF=0.020958;SB=1;DP4=119,208,3,4;EFF=STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W4682*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W290*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2	14310	.	G	A	86.0	PASS	DP=303;AF=0.023102;SB=13;DP4=114,181,0,7;EFF=STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W4682*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W290*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2	14313	.	T	C	317.0	PASS	DP=301;AF=0.073090;SB=21;DP4=109,158,3,20;EFF=SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D4683|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|C),SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D291|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|C|WARNING_TRANSCRIPT_NO_START_CODON)

Unfortunately, v4.5covid19 of snpeff is somewhat ill-defined: the conda-packaged version and the version on sourceforge differ in that the sourceforge version seems to be a complete SnpEff version (with covid19-related genomes included), while the bioconda version has only the covid19-related genomes, but lacks standard ones.

If we want to fix this annotation issue, @bgruening and me can think of four possibilities:

  1. Tweak the genbank file that we feed into snpeff build to have even v4.3 parse it correctly
    We haven't tested how much effort that would be, and it would mean that the WFs would be bound to this specific input, which feels a bit absurd.
  2. We take the built-in genome file from 4.5covid19, backport it to 4.3 (which means rewriting the version info inside the genome file) and offer it as a cached genome file on usegalaxy.* instances).
    I confirmed already that this is working, but, of course, you cannot port the WFs to an instance without the genome files then.
  3. We fix the conda version of snpeff v4.5covid19, then release a new wrapper version of snpeff and have the WFs use that one.
    This is a bit more work and could cause ambiguities in the snpeff versioning and when you installed it from conda.
  4. We create a new covid19-specific snpeff wrapper using the current bioconda package and add it to the toolshed alongside the current one.
    @bgruening has already prepared such a wrapper and could provide the PR for it.

If you're interested please add your thoughts, ideas, alternative suggestions here!

Little question about the datasets you used

Hello there!
Great repo, it's being really useful!
One question though, we are downloading all raw data available at SRA, specially the project PRJNA607948, which provides sequences for both ONT and Illumina using SISPA protocol.
We are finding splitted R1 and R2 files with different number of reads when we download them with fastq-dump. Have you faced this problem? I've seen you have them marked as successfully analyzed. Maybe we are making some stupid mistake.
We don't obtain the same number as stated by the authors here (https://openresearch.labkey.com/wiki/ZEST/Ncov/page.view?name=SARS-CoV-2%20Deep%20Sequencing) either.
Thank you very much!

vcf-vcf intersect error

Hi, I ran into this error during the ARTIC workflow:

index file localref.fa.fai not found, generating...
/Users/peerahemarajata/galaxy/database/jobs_directory/000/137/tool_script.sh: line 25: 48088 Segmentation fault: 11 vcfintersect -v -r 'localref.fa' -w "0" -i '/Users/peerahemarajata/galaxy/database/objects/4/9/1/dataset_491da1b9-10f9-42a5-8d32-75c302cb6dc1.dat' '/Users/peerahemarajata/galaxy/database/objects/f/e/5/dataset_fe5536e9-af7c-4279-91eb-3afd1a196ef8.dat' > '/Users/peerahemarajata/galaxy/database/objects/0/0/6/dataset_00634612-08a3-4a35-bb49-0184a8e303a4.dat'

I looked at the command and there was no index file as an input for vcfintersect, only the reference FASTA which I have copied to the working history and specified as a reference input for a re-run, but I also ran into the same problem. Thank you!

Missing Indels

My own analysis of Illumina (SRR11140750, bottom track) and nanopore (SRR11140751, top track) data from the same swab sample shows your variant analysis doesn't include indels:

image

You probably should include --call-indels in your call to lofreq call

MultiQC does not accept genome results produced by QualiMap BamQC

At the very last step of the workflow, I encounter this error:

Module 'bamtools: 'Stats for BAM file(s)' not found in the file 'A12345_genome_results'

I looked at the data and it did not have the 'Stats for BAM file(s)' line that seemed to be required by MultiQC. Tried to change tool versions but ended up the same way. Any advice would be much appreciated! BTW, I am using Galaxy Version 20.05.

Error while running Unicycler.

I am getting error while running Unicycler.

tput: No value for $TERM and no -T specified
Error: the paired read input files have an unequal number of reads

How to resolve it?

Setting Unicycler parameters for workflow.

Sir, I am forming workflow using Unicycler. I have to set two parameters --min_fasta_length and
--linear_seqs as 10000 and 1 respectively. I have set it while workflow formation and save it to use both the parameters for the next time. Each time it ask to set the parameters at run time rather using them as hardcoded values.

Error 404 on https://usegalaxy.eu/u/wolfgang-maier/w/covid-19-variation-analysis-on-se-data links from https://covid19.galaxyproject.org/genomics/global_platform/#what-is-this page

SRR11313490 and SRR11313499

Hi,
You may have seen Jesse Bloom's recent preprint on the recovery of deleted SRA files. @jbloom managed to recoved most of the data from Google Cloud, except for two, SRR11313490 and SRR11313499. I Googled these IDs, and found a dead link to your repository, but I could find it back through the history (e.g. it is available at Latest commit f5b1766 on 28 Apr 2020). Had you by any chance downloaded the corresponding SRA data, and do you still have them?
Thanks!

ARTIC workflow error during Realigh Read with samtools

Hi, I ran in the the error below while running the ARTIC workflow during Realign Reads step:

sort: unrecognized option '--no-PG'
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]

Any suggestion would be highly appreciated. Thank you!

Explain outputs and choice of tree in MCRA analysis

  1. The MCRA README has an empty Outputs section

    With the rather large number of datasets produced by IQ-TREE, an overview about what to expect would be nice. BTW, why is the consensus tree dataset empty?

  2. I think it would be good to add a short rationale why the ipynb uses iqtree's BIONJ tree for further analysis, and not the MaxLikelihood tree. To me, at least, that seems like a surprising choice. The outputs section (see 1.) seems like a good place for this.

  3. It would be great to mention where and how you can use the ipynb. It's probably in the manuscript, but it would be nice to have that info also in the repo itself.

Better cover page

@bgruening we need to update front README wit summary of main findings. I'll do genomics part over the weekend. I want to prepare new release by Monday morning EU time.

Do we want workflow .ga files as part of the repo?

Pro:

  • each repo release and accompanying zenodo snapshot would have an unchangeable workflow file in it, which is certainly a big plus for reproducibilty
  • at least with pretty-printed json .ga files we gain some rudimentary diffing of workflow versions

Con:

  • extra work to keep repo versions and published workflow versions on Galaxy instances in sync

Incompatibility between Pre-processing and Assembly wfs

Assembly is supposed to start with the pre-processing output, but @nekrut's assembly history shows that the input files there have been generated with Picard sam_to_fastq, while the pre-processing wf ends with samtools fastx.

In fact, when you try to combine the two workflows, you're getting a Unicycler error (at least on EU):
Error: the paired read input files have an unequal number of reads

Presumably, if using samtools fastx, one would have to exclude reads the mate of which isn't mapped, or similar. Alternatively, use the Picard conversion insetad-

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.