galaxyproject / sars-cov-2 Goto Github PK

Ongoing analysis of COVID-19 using Galaxy, BioConda and public research infrastructures

Home Page: https://covid19.galaxyproject.org

License: MIT License

Jupyter Notebook 81.88% JavaScript 0.03% Vue 0.23% Shell 4.00% Python 0.18% Stylus 0.03% HTML 13.66%

covid-2019 usegalaxy coronavirus 2019-ncov covid19 covid

sars-cov-2's Introduction

home	heroText	description
true	Global platform for the analysis of SARS-CoV-2 data: Genomics, Cheminformatics, and Proteomics	Using open source tools and public cyberinfrastructure for transparent, reproducible analyses of viral datasets.

Visit our new SARS-CoV-2 surveillance site!

As of January 2022 this site will no longer be updated. Please visit our new site at https://galaxyproject.org/projects/covid19/ for the latest information on our surveillance efforts!

The goal of this resource is to provide publicly accessible infrastructure and workflows for SARS-CoV-2 data analyses. We currently feature three different types of analyses: Genomics, Cheminformatics, and Proteomics.

The analyses have been performed using the Galaxy platform and open source tools from BioConda. Tools were run using XSEDE resources maintained by the Texas Advanced Computing Center (TACC), Pittsburgh Supercomputing Center (PSC), and Indiana University in the U.S., de.NBI, VSC cloud resources and IFB cluster resources on the European side, STFC-IRIS at the Diamond Light Source, and ARDC cloud resources in Australia.

sars-cov-2's People

Contributors

Stargazers

Watchers

sars-cov-2's Issues

Error while running gffread.

I am getting an error while running gffread.

No fasta index found for genomeref.fa. Rebuilding, please wait..
Error: sequence lines in a FASTA record must have the same length!

License

Is there a license that governs the use/distribution of these workflows?

Should variation analysis build on preprocessing?

My naive expectation was that I could use the output of the preprocessing workflow (at least the Illumina parts of it) as input to the variation analysis workflow, but this doesn't seem to be the case.

The public history seems to use input Illumina data that has been preprocessed (the stdout contains corresponding info), but it's unclear how exactly.

Is it possible to adjust the workflows to make them work together?

Wrong page included in /genomics/no-more-business-as-usual/

The "Deploy" section of the No more business as usual page repeats the same content as the "RecombinationSelection" section. Both use a <Content> component to include the page /genomics/no-more-business-as-usual/6-RecombinationSelection/, but I assume the "Deploy" section should include /genomics/no-more-business-as-usual/7-VariantsDescription/ instead.

## RecombinationSelection
<Content :page-key="$site.pages.find(p => p.path === '/genomics/no-more-business-as-usual/6-RecombinationSelection/').key"/>

## Deploy
<Content :page-key="$site.pages.find(p => p.path === '/genomics/no-more-business-as-usual/6-RecombinationSelection/').key"/>

Alignment section needs editing

Consensus genome and variant calling from Illumina only

Hi everyone

I have been working with the National Institute for Communicable Diseases (NICD) on some of their SARS-CoV-2 sequencing. I adapted the "variation" workflow to call variants in Illumina data and also produce an inferred consensus genome. The current "Assembly" workflow assumes that you have access to both Illumina and Nanopore genomes for a sample, which is a pretty rare situation. My workflow (with some TODOs in the step annotation) is in a gist. Comments and additions are welcome!

If it is found to be useful perhaps it can be incorporated into the COVID-19 resources page.

P.S. for those with ARTIC Amplicon data I created a workflow for analysing that as did Thanh le Viet. Pasting them here in case they are useful.

Fix links for the workflow badges for every instance

usegalaxy.org
usegalaxy.org.au
usegalaxy.eu

Source for number of variants and samples on main page

Currently, the main page states this under Results for Genomics:

These lists are updated daily. There are 397 sites showing intra-host variation across 33 samples (with frequencies between 5% and 95%). Twenty nine samples have fixed differences at 39 sites from the published reference.

This leaves two questions:

when have these numbers been updated last (based on which samples)?
when is a variant considered a fixed difference?

When I analyzed https://covid19.galaxyproject.org/genomics/4-Variation/variant_list.tsv today, I got this for the filter condition 0.95 >= float(af) >= 0.05:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259

For the fixed differences I tried float(af) == 1.0 giving:
Samples with variants: 55
Total number of variants observed: 27
Number of sites observed to carry variants: 27

and float(af) > 0.95 resulting in:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259

Should we rename MCRA to MRCA?

Unfortunately, I did not have time to read up on the question of which is the right acronym for most recent common ancestor analysis before, but from a bit of research now MRCA seems to be in much wider use (and also makes more sense). Whenever you find MCRA it looks more like a typo on pages that also use MRCA.

It would require a bit of effort though to change all occurrences of MCRA in this repo, but also in the workflow and the notebook.

snpeff build issues with NC_045512.2 genbank record

I noticed today (after a user complaint) that apparently version 4.3 of SnpEff, which is used in the Variation workflows, fails to parse the NC_045512.2 genbank file correctly at the snpeff build step.

This leads to:

missing annotations of effects on the mature peptides produced from the ORF1ab precursor
wrong reporting of the ORF1ab AA length as 7101 instead of 7096 in snpeff's EFF field
shifted amino acid numbers at ORF1ab positions past the ribosomal slippage site.
All reported AA positions for variants downstream of this site seem to be shifted by 5 compared to the reference sequence.

None of this affects the tabular reports produced by SnpSift because this one does not provide these bits of info, but it means that all VCFs produced by the workflows suffer from the annotation error in 3. (though 2. provides a hint at what happened).

The underlying bug in SnpEff has been fixed in the recent 4.5covid19 release of snpeff, which also provides a built-in NC_045512.2 genome file. Annotations with either that built-in genome or a genome rebuilt from the genbank file with the 4.5covid19 snpeff version, yield identical and expected results.

Compare this output snippet of snpeff v4.3 for SRR10971381 taken from https://usegalaxy.org/datasets/bbd44e69cb8906b5ace4f1e83ab23509/display/?preview=True:

NC_045512 | 14268 | . | G | A | 103.0 | PASS | DP=473;AF=0.019027;SB=1;DP4=179,284,3,7;EFF=SYNONYMOUS_CODING(LOW\|SILENT\|acG/acA\|T4673\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14272 | . | G | T | 71.0 | PASS | DP=374;AF=0.016043;SB=3;DP4=141,226,4,3;EFF=STOP_GAINED(HIGH\|NONSENSE\|Gag/Tag\|E4675*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|T)
NC_045512 | 14309 | . | G | A | 79.0 | PASS | DP=334;AF=0.020958;SB=1;DP4=119,208,3,4;EFF=STOP_GAINED(HIGH\|NONSENSE\|tGg/tAg\|W4687*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14310 | . | G | A | 86.0 | PASS | DP=303;AF=0.023102;SB=13;DP4=114,181,0,7;EFF=STOP_GAINED(HIGH\|NONSENSE\|tgG/tgA\|W4687*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14313 | . | T | C | 317.0 | PASS | DP=301;AF=0.073090;SB=21;DP4=109,158,3,20;EFF=SYNONYMOUS_CODING(LOW\|SILENT\|gaT/gaC\|D4688\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|C)

to the corresponding bit of output from v4.5covid19:

NC_045512.2	14268	.	G	A	103.0	PASS	DP=473;AF=0.019027;SB=1;DP4=179,284,3,7;EFF=SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T4668|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T276|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2	14272	.	G	T	71.0	PASS	DP=374;AF=0.016043;SB=3;DP4=141,226,4,3;EFF=STOP_GAINED(HIGH|NONSENSE|Gag/Tag|E4670*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|T),STOP_GAINED(HIGH|NONSENSE|Gag/Tag|E278*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|T|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2	14309	.	G	A	79.0	PASS	DP=334;AF=0.020958;SB=1;DP4=119,208,3,4;EFF=STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W4682*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W290*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2	14310	.	G	A	86.0	PASS	DP=303;AF=0.023102;SB=13;DP4=114,181,0,7;EFF=STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W4682*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W290*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2	14313	.	T	C	317.0	PASS	DP=301;AF=0.073090;SB=21;DP4=109,158,3,20;EFF=SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D4683|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|C),SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D291|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|C|WARNING_TRANSCRIPT_NO_START_CODON)

Unfortunately, v4.5covid19 of snpeff is somewhat ill-defined: the conda-packaged version and the version on sourceforge differ in that the sourceforge version seems to be a complete SnpEff version (with covid19-related genomes included), while the bioconda version has only the covid19-related genomes, but lacks standard ones.

If we want to fix this annotation issue, @bgruening and me can think of four possibilities:

Tweak the genbank file that we feed into snpeff build to have even v4.3 parse it correctly
We haven't tested how much effort that would be, and it would mean that the WFs would be bound to this specific input, which feels a bit absurd.
We take the built-in genome file from 4.5covid19, backport it to 4.3 (which means rewriting the version info inside the genome file) and offer it as a cached genome file on usegalaxy.* instances).
I confirmed already that this is working, but, of course, you cannot port the WFs to an instance without the genome files then.
We fix the conda version of snpeff v4.5covid19, then release a new wrapper version of snpeff and have the WFs use that one.
This is a bit more work and could cause ambiguities in the snpeff versioning and when you installed it from conda.
We create a new covid19-specific snpeff wrapper using the current bioconda package and add it to the toolshed alongside the current one.
@bgruening has already prepared such a wrapper and could provide the PR for it.

If you're interested please add your thoughts, ideas, alternative suggestions here!

Little question about the datasets you used

Hello there!
Great repo, it's being really useful!
One question though, we are downloading all raw data available at SRA, specially the project PRJNA607948, which provides sequences for both ONT and Illumina using SISPA protocol.
We are finding splitted R1 and R2 files with different number of reads when we download them with fastq-dump. Have you faced this problem? I've seen you have them marked as successfully analyzed. Maybe we are making some stupid mistake.
We don't obtain the same number as stated by the authors here (https://openresearch.labkey.com/wiki/ZEST/Ncov/page.view?name=SARS-CoV-2%20Deep%20Sequencing) either.
Thank you very much!

vcf-vcf intersect error

Hi, I ran into this error during the ARTIC workflow:

index file localref.fa.fai not found, generating...
/Users/peerahemarajata/galaxy/database/jobs_directory/000/137/tool_script.sh: line 25: 48088 Segmentation fault: 11 vcfintersect -v -r 'localref.fa' -w "0" -i '/Users/peerahemarajata/galaxy/database/objects/4/9/1/dataset_491da1b9-10f9-42a5-8d32-75c302cb6dc1.dat' '/Users/peerahemarajata/galaxy/database/objects/f/e/5/dataset_fe5536e9-af7c-4279-91eb-3afd1a196ef8.dat' > '/Users/peerahemarajata/galaxy/database/objects/0/0/6/dataset_00634612-08a3-4a35-bb49-0184a8e303a4.dat'

I looked at the command and there was no index file as an input for vcfintersect, only the reference FASTA which I have copied to the working history and specified as a reference input for a re-run, but I also ran into the same problem. Thank you!

Missing Indels

My own analysis of Illumina (SRR11140750, bottom track) and nanopore (SRR11140751, top track) data from the same swab sample shows your variant analysis doesn't include indels:

You probably should include --call-indels in your call to lofreq call

MultiQC does not accept genome results produced by QualiMap BamQC

At the very last step of the workflow, I encounter this error:

Module 'bamtools: 'Stats for BAM file(s)' not found in the file 'A12345_genome_results'

I looked at the data and it did not have the 'Stats for BAM file(s)' line that seemed to be required by MultiQC. Tried to change tool versions but ended up the same way. Any advice would be much appreciated! BTW, I am using Galaxy Version 20.05.

Error while running Unicycler.

I am getting error while running Unicycler.

tput: No value for $TERM and no -T specified
Error: the paired read input files have an unequal number of reads

How to resolve it?

Setting Unicycler parameters for workflow.

Sir, I am forming workflow using Unicycler. I have to set two parameters --min_fasta_length and
--linear_seqs as 10000 and 1 respectively. I have set it while workflow formation and save it to use both the parameters for the next time. Each time it ask to set the parameters at run time rather using them as hardcoded values.

Error 404 on https://usegalaxy.eu/u/wolfgang-maier/w/covid-19-variation-analysis-on-se-data links from https://covid19.galaxyproject.org/genomics/global_platform/#what-is-this page

Hi Amazing Galaxy world ;)

And a good year !

Just find an hyperlink https://covid19.galaxyproject.org/genomics/global_platform/#what-is-this giving an "error 404" message on this webpage https://covid19.galaxyproject.org/genomics/global_platform/#what-is-this in table 1 / Workflow / EU (from the first line) and it seems this is the same for all EU workflows (at least on this table).

Maybe:

Have a nice day,

Yvan

SRR11313490 and SRR11313499

Hi,
You may have seen Jesse Bloom's recent preprint on the recovery of deleted SRA files. @jbloom managed to recoved most of the data from Google Cloud, except for two, SRR11313490 and SRR11313499. I Googled these IDs, and found a dead link to your repository, but I could find it back through the history (e.g. it is available at Latest commit f5b1766 on 28 Apr 2020). Had you by any chance downloaded the corresponding SRA data, and do you still have them?
Thanks!

Docker image doesn't appear to contain hg38

Started Docker image as per:

https://covid19.galaxyproject.org/genomics/

with:

docker run -p 8080:80 quay.io/galaxy/covid-19-training

The initial workflow 1 - read pre-processing won't run because minimap2 expects genome build, but can't find one. The error is: Parameter ref_file requires a value, but has no legal values defined.

Is hg38 included in the Docker image?

Thanks,

dave

ARTIC workflow error during Realigh Read with samtools

Hi, I ran in the the error below while running the ARTIC workflow during Realign Reads step:

sort: unrecognized option '--no-PG'
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]

Any suggestion would be highly appreciated. Thank you!

Explain outputs and choice of tree in MCRA analysis

The MCRA README has an empty Outputs section

With the rather large number of datasets produced by IQ-TREE, an overview about what to expect would be nice. BTW, why is the consensus tree dataset empty?
I think it would be good to add a short rationale why the ipynb uses iqtree's BIONJ tree for further analysis, and not the MaxLikelihood tree. To me, at least, that seems like a surprising choice. The outputs section (see 1.) seems like a good place for this.
It would be great to mention where and how you can use the ipynb. It's probably in the manuscript, but it would be nice to have that info also in the repo itself.

Metadata Update

Hello,
Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 data https://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/ZNRXOE
Would you be willing to help with the maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for the medical science community.

Recombination workflow link needs fixing

The link in
https://github.com/galaxyproject/SARS-CoV-2/blob/master/Recombination/README.md#history-and-workflow

points to a history instead of to a workflow.

Better cover page

@bgruening we need to update front README wit summary of main findings. I'll do genomics part over the weekend. I want to prepare new release by Monday morning EU time.

MCRA workflow requires unspecified user input

The current workflow version at https://github.com/galaxyproject/SARS-CoV-2/blob/master/MCRA/Galaxy-Workflow-MCRA.ga (and also the one on usegalaxy.org) won't run with just the accession and dates file as described, but ask to specify a List of fields for Step 4: Cut

It's not obvious what to set there.

Do we want workflow .ga files as part of the repo?

Pro:

each repo release and accompanying zenodo snapshot would have an unchangeable workflow file in it, which is certainly a big plus for reproducibilty
at least with pretty-printed json .ga files we gain some rudimentary diffing of workflow versions

Con:

extra work to keep repo versions and published workflow versions on Galaxy instances in sync

Is "COVID-19 genome" a correct description?

The repo README contains "COVID-19 genome".

As far as I understand "COVID-19" is the name for the disease. Should this be SARS-CoV-2 genome instead?

Incompatibility between Pre-processing and Assembly wfs

Assembly is supposed to start with the pre-processing output, but @nekrut's assembly history shows that the input files there have been generated with Picard sam_to_fastq, while the pre-processing wf ends with samtools fastx.

In fact, when you try to combine the two workflows, you're getting a Unicycler error (at least on EU):
Error: the paired read input files have an unequal number of reads

Presumably, if using samtools fastx, one would have to exclude reads the mate of which isn't mapped, or similar. Alternatively, use the Picard conversion insetad-

galaxyproject / sars-cov-2 Goto Github PK

sars-cov-2's Introduction

Visit our new SARS-CoV-2 surveillance site!

sars-cov-2's People

Contributors

Stargazers

Watchers

Forkers

sars-cov-2's Issues

Recommend Projects

Recommend Topics

Recommend Org