galaxyproject / sars-cov-2 Goto Github PK
View Code? Open in Web Editor NEWOngoing analysis of COVID-19 using Galaxy, BioConda and public research infrastructures
Home Page: https://covid19.galaxyproject.org
License: MIT License
Ongoing analysis of COVID-19 using Galaxy, BioConda and public research infrastructures
Home Page: https://covid19.galaxyproject.org
License: MIT License
Is there a license that governs the use/distribution of these workflows?
Unfortunately, I did not have time to read up on the question of which is the right acronym for most recent common ancestor analysis before, but from a bit of research now MRCA seems to be in much wider use (and also makes more sense). Whenever you find MCRA it looks more like a typo on pages that also use MRCA.
It would require a bit of effort though to change all occurrences of MCRA in this repo, but also in the workflow and the notebook.
I am getting error while running Unicycler.
tput: No value for $TERM and no -T specified
Error: the paired read input files have an unequal number of reads
How to resolve it?
The current workflow version at https://github.com/galaxyproject/SARS-CoV-2/blob/master/MCRA/Galaxy-Workflow-MCRA.ga (and also the one on usegalaxy.org) won't run with just the accession and dates file as described, but ask to specify a List of fields
for Step 4: Cut
It's not obvious what to set there.
Started Docker image as per:
https://covid19.galaxyproject.org/genomics/
with:
docker run -p 8080:80 quay.io/galaxy/covid-19-training
The initial workflow 1 - read pre-processing
won't run because minimap2 expects genome build, but can't find one. The error is: Parameter ref_file requires a value, but has no legal values defined.
Is hg38 included in the Docker image?
Thanks,
dave
The link in
https://github.com/galaxyproject/SARS-CoV-2/blob/master/Recombination/README.md#history-and-workflow
points to a history instead of to a workflow.
Assembly is supposed to start with the pre-processing output, but @nekrut's assembly history shows that the input files there have been generated with Picard sam_to_fastq, while the pre-processing wf ends with samtools fastx
.
In fact, when you try to combine the two workflows, you're getting a Unicycler error (at least on EU):
Error: the paired read input files have an unequal number of reads
Presumably, if using samtools fastx
, one would have to exclude reads the mate of which isn't mapped, or similar. Alternatively, use the Picard conversion insetad-
Hi,
You may have seen Jesse Bloom's recent preprint on the recovery of deleted SRA files. @jbloom managed to recoved most of the data from Google Cloud, except for two, SRR11313490 and SRR11313499. I Googled these IDs, and found a dead link to your repository, but I could find it back through the history (e.g. it is available at Latest commit f5b1766 on 28 Apr 2020). Had you by any chance downloaded the corresponding SRA data, and do you still have them?
Thanks!
Pro:
Con:
At the very last step of the workflow, I encounter this error:
Module 'bamtools: 'Stats for BAM file(s)' not found in the file 'A12345_genome_results'
I looked at the data and it did not have the 'Stats for BAM file(s)' line that seemed to be required by MultiQC. Tried to change tool versions but ended up the same way. Any advice would be much appreciated! BTW, I am using Galaxy Version 20.05.
Hi everyone
I have been working with the National Institute for Communicable Diseases (NICD) on some of their SARS-CoV-2 sequencing. I adapted the "variation" workflow to call variants in Illumina data and also produce an inferred consensus genome. The current "Assembly" workflow assumes that you have access to both Illumina and Nanopore genomes for a sample, which is a pretty rare situation. My workflow (with some TODOs in the step annotation) is in a gist. Comments and additions are welcome!
If it is found to be useful perhaps it can be incorporated into the COVID-19 resources page.
P.S. for those with ARTIC Amplicon data I created a workflow for analysing that as did Thanh le Viet. Pasting them here in case they are useful.
Hello,
Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 data https://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/ZNRXOE
Would you be willing to help with the maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for the medical science community.
My naive expectation was that I could use the output of the preprocessing workflow (at least the Illumina parts of it) as input to the variation analysis workflow, but this doesn't seem to be the case.
The public history seems to use input Illumina data that has been preprocessed (the stdout contains corresponding info), but it's unclear how exactly.
Is it possible to adjust the workflows to make them work together?
Hi Amazing Galaxy world ;)
And a good year !
Just find an hyperlink https://covid19.galaxyproject.org/genomics/global_platform/#what-is-this giving an "error 404" message on this webpage https://covid19.galaxyproject.org/genomics/global_platform/#what-is-this in table 1 / Workflow / EU (from the first line) and it seems this is the same for all EU workflows (at least on this table).
Maybe:
Have a nice day,
Yvan
The "Deploy" section of the No more business as usual page repeats the same content as the "RecombinationSelection" section. Both use a <Content>
component to include the page /genomics/no-more-business-as-usual/6-RecombinationSelection/
, but I assume the "Deploy" section should include /genomics/no-more-business-as-usual/7-VariantsDescription/
instead.
## RecombinationSelection
<Content :page-key="$site.pages.find(p => p.path === '/genomics/no-more-business-as-usual/6-RecombinationSelection/').key"/>
## Deploy
<Content :page-key="$site.pages.find(p => p.path === '/genomics/no-more-business-as-usual/6-RecombinationSelection/').key"/>
Is this possible, and if so can someone provide a short guide?
The repo README contains "COVID-19 genome".
As far as I understand "COVID-19" is the name for the disease. Should this be SARS-CoV-2
genome instead?
Hi, I ran into this error during the ARTIC workflow:
index file localref.fa.fai not found, generating...
/Users/peerahemarajata/galaxy/database/jobs_directory/000/137/tool_script.sh: line 25: 48088 Segmentation fault: 11 vcfintersect -v -r 'localref.fa' -w "0" -i '/Users/peerahemarajata/galaxy/database/objects/4/9/1/dataset_491da1b9-10f9-42a5-8d32-75c302cb6dc1.dat' '/Users/peerahemarajata/galaxy/database/objects/f/e/5/dataset_fe5536e9-af7c-4279-91eb-3afd1a196ef8.dat' > '/Users/peerahemarajata/galaxy/database/objects/0/0/6/dataset_00634612-08a3-4a35-bb49-0184a8e303a4.dat'
I looked at the command and there was no index file as an input for vcfintersect, only the reference FASTA which I have copied to the working history and specified as a reference input for a re-run, but I also ran into the same problem. Thank you!
Currently, the main page states this under Results for Genomics:
These lists are updated daily. There are 397 sites showing intra-host variation across 33 samples (with frequencies between 5% and 95%). Twenty nine samples have fixed differences at 39 sites from the published reference.
This leaves two questions:
When I analyzed https://covid19.galaxyproject.org/genomics/4-Variation/variant_list.tsv today, I got this for the filter condition 0.95 >= float(af) >= 0.05
:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259
For the fixed differences I tried float(af) == 1.0
giving:
Samples with variants: 55
Total number of variants observed: 27
Number of sites observed to carry variants: 27
and float(af) > 0.95
resulting in:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259
I noticed today (after a user complaint) that apparently version 4.3 of SnpEff, which is used in the Variation workflows, fails to parse the NC_045512.2 genbank file correctly at the snpeff build step.
This leads to:
None of this affects the tabular reports produced by SnpSift because this one does not provide these bits of info, but it means that all VCFs produced by the workflows suffer from the annotation error in 3. (though 2. provides a hint at what happened).
The underlying bug in SnpEff has been fixed in the recent 4.5covid19 release of snpeff, which also provides a built-in NC_045512.2 genome file. Annotations with either that built-in genome or a genome rebuilt from the genbank file with the 4.5covid19 snpeff version, yield identical and expected results.
Compare this output snippet of snpeff v4.3 for SRR10971381 taken from https://usegalaxy.org/datasets/bbd44e69cb8906b5ace4f1e83ab23509/display/?preview=True:
NC_045512 | 14268 | . | G | A | 103.0 | PASS | DP=473;AF=0.019027;SB=1;DP4=179,284,3,7;EFF=SYNONYMOUS_CODING(LOW\|SILENT\|acG/acA\|T4673\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14272 | . | G | T | 71.0 | PASS | DP=374;AF=0.016043;SB=3;DP4=141,226,4,3;EFF=STOP_GAINED(HIGH\|NONSENSE\|Gag/Tag\|E4675*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|T)
NC_045512 | 14309 | . | G | A | 79.0 | PASS | DP=334;AF=0.020958;SB=1;DP4=119,208,3,4;EFF=STOP_GAINED(HIGH\|NONSENSE\|tGg/tAg\|W4687*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14310 | . | G | A | 86.0 | PASS | DP=303;AF=0.023102;SB=13;DP4=114,181,0,7;EFF=STOP_GAINED(HIGH\|NONSENSE\|tgG/tgA\|W4687*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14313 | . | T | C | 317.0 | PASS | DP=301;AF=0.073090;SB=21;DP4=109,158,3,20;EFF=SYNONYMOUS_CODING(LOW\|SILENT\|gaT/gaC\|D4688\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|C)
to the corresponding bit of output from v4.5covid19:
NC_045512.2 14268 . G A 103.0 PASS DP=473;AF=0.019027;SB=1;DP4=179,284,3,7;EFF=SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T4668|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T276|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2 14272 . G T 71.0 PASS DP=374;AF=0.016043;SB=3;DP4=141,226,4,3;EFF=STOP_GAINED(HIGH|NONSENSE|Gag/Tag|E4670*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|T),STOP_GAINED(HIGH|NONSENSE|Gag/Tag|E278*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|T|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2 14309 . G A 79.0 PASS DP=334;AF=0.020958;SB=1;DP4=119,208,3,4;EFF=STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W4682*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W290*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2 14310 . G A 86.0 PASS DP=303;AF=0.023102;SB=13;DP4=114,181,0,7;EFF=STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W4682*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W290*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2 14313 . T C 317.0 PASS DP=301;AF=0.073090;SB=21;DP4=109,158,3,20;EFF=SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D4683|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|C),SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D291|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|C|WARNING_TRANSCRIPT_NO_START_CODON)
Unfortunately, v4.5covid19 of snpeff is somewhat ill-defined: the conda-packaged version and the version on sourceforge differ in that the sourceforge version seems to be a complete SnpEff version (with covid19-related genomes included), while the bioconda version has only the covid19-related genomes, but lacks standard ones.
If we want to fix this annotation issue, @bgruening and me can think of four possibilities:
If you're interested please add your thoughts, ideas, alternative suggestions here!
@bgruening we need to update front README wit summary of main findings. I'll do genomics part over the weekend. I want to prepare new release by Monday morning EU time.
Sir, I am forming workflow using Unicycler. I have to set two parameters --min_fasta_length and
--linear_seqs as 10000 and 1 respectively. I have set it while workflow formation and save it to use both the parameters for the next time. Each time it ask to set the parameters at run time rather using them as hardcoded values.
The MCRA README has an empty Outputs section
With the rather large number of datasets produced by IQ-TREE, an overview about what to expect would be nice. BTW, why is the consensus tree dataset empty?
I think it would be good to add a short rationale why the ipynb uses iqtree's BIONJ tree for further analysis, and not the MaxLikelihood tree. To me, at least, that seems like a surprising choice. The outputs section (see 1.) seems like a good place for this.
It would be great to mention where and how you can use the ipynb. It's probably in the manuscript, but it would be nice to have that info also in the repo itself.
Hi, I ran in the the error below while running the ARTIC workflow during Realign Reads step:
sort: unrecognized option '--no-PG'
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
Any suggestion would be highly appreciated. Thank you!
I am getting an error while running gffread.
No fasta index found for genomeref.fa. Rebuilding, please wait..
Error: sequence lines in a FASTA record must have the same length!
Hello there!
Great repo, it's being really useful!
One question though, we are downloading all raw data available at SRA, specially the project PRJNA607948, which provides sequences for both ONT and Illumina using SISPA protocol.
We are finding splitted R1 and R2 files with different number of reads when we download them with fastq-dump. Have you faced this problem? I've seen you have them marked as successfully analyzed. Maybe we are making some stupid mistake.
We don't obtain the same number as stated by the authors here (https://openresearch.labkey.com/wiki/ZEST/Ncov/page.view?name=SARS-CoV-2%20Deep%20Sequencing) either.
Thank you very much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.