scilifelab / ngi-rnaseq Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nf-core/rnaseq

51.0 24.0 42.0 46.48 MB

Nextflow RNA-Seq Best Practice analysis pipeline, used at the SciLifeLab National Genomics Infrastructure.

Home Page: https://ngisweden.scilifelab.se/

License: MIT License

Groovy 69.59% Shell 9.21% R 8.86% Perl 3.51% Python 5.01% HTML 3.50% Dockerfile 0.32%

bioinformatics rna-seq rnaseq rna-sequencing pipeline nextflow

ngi-rnaseq's Introduction

This pipeline has moved!

This pipeline has been moved to the new nf-core project. You can now find it here:

https://github.com/nf-core/rnaseq

If you have any problems with the pipeline, please create an issue at the above repository instead.

To find out more about nf-core, visit http://nf-co.re/

This repository will be archived to maintain the released versions for future reruns, in the spirit of full reproducibility.

If you have any questions, please get in touch: [email protected]

// Phil Ewels, 2018-08-20

Introduction

NGI-RNAseq is a bioinformatics analysis pipeline used for RNA sequencing data.

It pre-processes raw data from FastQ inputs (FastQC, Trim Galore!), aligns the reads (STAR or HiSAT2), generates gene counts (featureCounts, StringTie) and performs extensive quality-control on the results (RSeQC, dupRadar, Preseq, edgeR, MultiQC). See the output documentation for more details of the results.

The pipeline is built using Nextflow, a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.

The pipeline was written at the National Genomics Infastructure at SciLifeLab Stockholm, Sweden.

Documentation

The NGI-RNAseq pipeline comes with documentation about the pipeline, found in the docs/ directory:

Credits

These scripts were written at the National Genomics Infrastructure, part of SciLifeLab in Stockholm, Sweden. The pipeline was developed by Phil Ewels (@ewels) and Rickard Hammarén (@Hammarn). Docker and AWS integration was led by Denis Moreno (@Galithil) and Phil Ewels (@ewels).

Many thanks to other who have helped out along the way too, including (but not limited to): @pditommaso, @orzechoj, @apeltzer, @colindaven.

Participating Institutes

NGI-RNAseq is now used by a number of core sequencing and bioinformatics facilities. Some of these are listed below. If you use this pipeline too, please let us know in an issue and we will add you to the list.

	National Genomics Infrastructure (NGI), Sweden	https://ngisweden.scilifelab.se/
	Quantitative Biology Center (QBiC), Germany	https://portal.qbic.uni-tuebingen.de/portal/

ngi-rnaseq's People

Contributors

Stargazers

Watchers

ngi-rnaseq's Issues

MultiQC doesn't run when there is only one sample

Check that MultiQC still runs when there is no sample correlation step.

RSeQC version statement not updated

Looks like RSeQC never got updated to the new method of reporting version numbers:

NGI-RNAseq/main.nf

Line 770 in 0a56361

echo "Filename $bam_rseqc RseQC version: "\$(read_duplication.py --version)

It's then missing from the get_software_versions process.

Revert dodgy STAR alignment parameter

sep="\t" should be revised to sep="\\t" in write.table()

Add optional fastq-screen step

Since we run NGI QC for all samples that we sequences, including RNA, it would save us CPU time if we could run fastq-screen as a part of the pipeline. This would have to be optional, since most other users likely won't have a use for this.

"Unable to load AWS credentials from any provider in the chain"

I ran NGI-RNAseq using docker and it failed trying to get the genome files for STAR,. It did run fastqc, cutadapt, etc with no problem. Error below. I tried running it with a local fasta and gtf and it failed asking for a bed gene annotation file.

ERROR ~ Error executing process > 'star (null)'

Caused by:
Unable to load AWS credentials from any provider in the chain

Here's the command I ran:

nextflow run SciLifeLab/NGI-RNAseq -profile docker --singleEnd  \
--reads  '/data/NGS/NGS_Data/GEO/GSE85107_Armstrong2016_AML/fastq/SRR*fastq.gz' \
--genome GRCh37 --forward_stranded \
--outdir '/data/NGS/NGS_Data/GEO/GSE85107_Armstrong2016_AML/NGI-RNAseq.out' \

I'm running nextflow v 0.25.1 and docker v17.09.0-ce on CentOS 7.

Any thoughts? Since the AWS connection is in the docker VM, I shouldn't need any aws connection on my server, right? Thanks.

Add readme placeholder when not saving intermediate files

People are often thrown by the fact that there are no BAM files in the STAR / HISAT2 results folders by default. It would be great to save a placeholder README file there when --saveAlignedIntermediates is false, saying why the folder is empty and where to look for the BAM files.

Plot dupRadar boxplots

The current MultiQC plot for dupRadar is a little tricky to interpret.

Instead of plotting the dotted line from this plot as is done currently:

Plot the median values from this boxplot:

Needs these values extracting in the R script. Can then plot with the custom_content module if the output syntax is done properly.

error executing docker_test.sh - preseq library not found

Hi, I've installed the pipeline via docker on my RedHat server machine. When running the docker test script i get this error:

[23/fe7572] Submitted process > preseq (SRR4238359_subsampAlignedByCoord.out)
...
[ff/fdf462] Submitted process > preseq (SRR4238379_subsampAlignedByCoord.out)
ERROR ~ Error executing process > 'preseq (SRR4238355_subsampAlignedByCoord.out)'
Caused by:
  Process `preseq (SRR4238355_subsampAlignedByCoord.out)` terminated with an error exit status (127)
...
Command exit status:
  127
...
Command error:
  preseq: error while loading shared libraries: libgsl.so.0: cannot open shared object file: No such file or directory
  Signal 15 (TERM) caught by ps (3.3.12).

I've installed desired libgsl libraries outside docker, but no change. Is this server specific issue, or bug? i run the ChIPseq pipeline on the same machine flawlessly. thanks

Show good / bad examples in documentation

It would be useful to supply examples of a good/expected output for all our supported programs. Or we should at least specify wether the supplied results are of a good or bad experiment. FeatureCount is currently shows a library with rather low annotation amounts.

New --help option

Would be nice to move the comment stuff at the top into a help message that can be shown with --help.

eg. here: https://github.com/IARCbioinfo/needlestack/blob/master/needlestack.nf#L64

GFF file input

Hello,
Is it possible to use a GFF file for the annotations instead of a GTF file?
Or do you know of any good means of converting between these two formats?

Kind regards,
John Hellgren

Refactor/benchmark the RSeQC process

We have discussed on multiple occasions earlier - the scripts run in the RseQC process have very varying memory requirements. It would save resources if we could isolate the more memory intensive one(s) into their own process. As it is now the memory requirement for all of the is governed by the need of the most intensive one.

Test data URL will go down soon

Currently, the test data for this and our other pipelines is hosted on the UPPMAX milou webexport resource. This is going to disappear pretty soon. UPPMAX have no plans to replace this service anywhere else, so we need to come up with an alternative solution.

Using a project on the SNIC Science Cloud could work (we have a project there). We could also set up a mini sandboxed web server on one of our local servers. Or we could try to use some other kind of public data hosting service (such as GitHub if the data isn't too big?).

Phil

strandness parameter? ----strandRule '1++,1--,2+-,2-+' similar to --forward_stranded

I think that those two parameters are overlapped. If we use the first strand to construct library, the setting for HISAT2 is --rna-strandness fr-firststrand/RF. If instead, we use the second strand, the setting is fr-secondstrand/FR. And Liguo's RSeQC package uses the infer_expriment.py to infer the strandness. if the output is 1++,1--,2+-,2-+, I think this might indicate user used second strand to construct library. If 1+-,1-+,2++,2--, it indicate that the first strand was used to construct library. (PE model). If SE model, the first strand should be +-,-+.

I am confused why you separate those similar output here and create two independent parameter settings. Let me know if I miss something here.

Memory requirement for local jobs too high

The jobs run locally (i.e. on the login node) are get_software_versions and multiqc. These fail when I run them on bianca, where I can only get 7Gb memory on the login node. Currently I use the pipeline with base config file like this:

  $multiqc {
    memory = { check_max( 2.GB * task.attempt, 'memory' ) }
    executor = 'local'
    errorStrategy = { task.exitStatus in [143,137] ? 'retry' : 'ignore' }
  }
  $get_software_versions {
    memory = { check_max( 2.GB * task.attempt, 'memory' ) }
    executor = 'local'
    errorStrategy = 'ignore'
  }

Specify MultiQC modules to run

We can speed up the MultiQC execution a little by specifying which modules to run in the config file.

Option for generic extra CL options for commands

Just a thought I had;
It should be possible to send extra commands to certain processes in the pipeline without having to change the code. I.e supplying them on the command line. E.g. STAR suppling it with for instance clip3pNbases could sometimes be useful (i.e the Clontech SMARTer PICO prep).
I guess we would have to set up a params for each process for supplying extra CL options.

MultiQC General Stats table doesn't show all rRNA

The rRNA column in the General Statistics table doesn't show all rRNA - it needs to sum both rRNA and Mt rRNA from the biotypes counts.

Biotypes issues

Hi!

running the pipeline right now on some RNAseq samples (21).

.command.sh (looks good to me)

#!/bin/bash -euo pipefail
featureCounts -a Mus_musculus.GRCm38.90.gtf -g gene_id -o S634Nr25.1Aligned.sortedByCoord.out_gene.featureCounts.txt -p -s 0 S634Nr25.1Aligned.sortedByCoord.out.bam
featureCounts -a Mus_musculus.GRCm38.90.gtf -g gene_biotype -o S634Nr25.1Aligned.sortedByCoord.out_biotype.featureCounts.txt -p -s 0 S634Nr25.1Aligned.sortedByCoord.out.bam
cut -f 1,7 S634Nr25.1Aligned.sortedByCoord.out_biotype.featureCounts.txt > tmp_file
cat biotypes_header.txt tmp_file >> S634Nr25.1Aligned.sortedByCoord.out_biotype_counts_mqc.txt

However, the error.log shows something else:

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Mus_musculus.GRCm38.90.gtf ...                        ||
||    Features : 782519                                                       ||
||    Meta-features : 44                                                      ||
||    Chromosomes/contigs : 45                                                ||
||                                                                            ||
|| Process BAM file S634Nr25.1Aligned.sortedByCoord.out.bam...                ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 31659674                                              ||
||    Successfully assigned fragments : 25472400 (80.5%)                      ||
||    Running time : 2.95 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

cat: biotypes_header.txt: No such file or directory
.../work/fd/86270257e928dc71bfc94335dbfd9b/.command.stub: line 99: 114575 Terminated              nxf_trace "$pid" .command.trace

(which breaks that step apparently).

I cd into the respective directory and see this:

 ls -l
total 10
lrwxrwxrwx 1 iiipe01 ii       82 Mar 17 15:49 biotypes_header.txt -> /home/tu/iiipe01/.nextflow/assets/scilifelab/NGI-RNAseq/assets/biotypes_header.txt
lrwxrwxrwx 1 iiipe01 ii      108 Mar 17 15:49 Mus_musculus.GRCm38.90.gtf -> ../Reference/Ensembl_GRCm38/Annotation/Mus_musculus.GRCm38.90.gtf
lrwxrwxrwx 1 iiipe01 ii      124 Mar 17 15:49 S634Nr25.1Aligned.sortedByCoord.out.bam -> .../work/d6/fc1ba48497a4fda0562c2b6be94890/S634Nr25.1Aligned.sortedByCoord.out.bam
-rw-r--r-- 1 iiipe01 ii     1064 Mar 17 15:55 S634Nr25.1Aligned.sortedByCoord.out_biotype_counts_mqc.txt
-rw-r--r-- 1 iiipe01 ii 17936391 Mar 17 15:55 S634Nr25.1Aligned.sortedByCoord.out_biotype.featureCounts.txt
-rw-r--r-- 1 iiipe01 ii      325 Mar 17 15:55 S634Nr25.1Aligned.sortedByCoord.out_biotype.featureCounts.txt.summary
-rw-r--r-- 1 iiipe01 ii 19309041 Mar 17 15:52 S634Nr25.1Aligned.sortedByCoord.out_gene.featureCounts.txt
-rw-r--r-- 1 iiipe01 ii      325 Mar 17 15:52 S634Nr25.1Aligned.sortedByCoord.out_gene.featureCounts.txt.summary
-rw-r--r-- 1 iiipe01 ii     1064 Mar 17 15:55 tmp_file

(so it apparently also found its way there).

Resubmitting didn't change things and the way I called the pipeline should also be fine I guess:

nextflow run -profile binac scilifelab/NGI-RNAseq --reads 'RAW/*{1,2}.fastq.gz' --star_index ../Reference/star --fasta ../Reference/Ensembl_GRCm38/Sequence/Bowtie2Index/Mus_musculus.GRCm38.dna.toplevel.fa --gtf ../Reference/Ensembl_GRCm38/Annotation/Mus_musculus.GRCm38.90.gtf --bed12 ../Reference/Ensembl_GRCm38/Annotation/Mus_musculus.GRCm38.90.bed12 --saveReference -resume

(not that I privaticed the paths to the respective files accordingly, but wondered what might have went wrong).
Pipeline version1.4dev / latest Docker image via singularity with binac profile.

I'll try to look into it myself and will update here, if I find something - but maybe somebody else spots an issue...

Merge featureCounts needs Python on UPPMAX

The merge_featurecounts step requires Python. Not everyone has this in their path by default, so it would be good to load the UPPMAX environment module for Py2.7 for consistency.

Running with custom reference does not produce output, but claims successful run

Hej!

I am trying to run the NGI-RNAseq Nextflow pipeline. I am executing the command:
nextflow run SciLifeLab/NGI-RNAseq -r 1.0.4 -c my.config -w analysis/nextflow_workdir

The config file my.config looks like this:

params {
  project = 'b2016346'
  aligner = 'star'
  outdir = 'analysis/nextflow_out'
  reads = 'data/RNAseq/C.Ohlsson_16_01/P6373_*_R{1,2}.fastq.gz'
  reverse_stranded = true
  genome = 'GRCm38.dna.primary_assembly'
  genomes {
    'GRCm38.dna.primary_assembly' {
      fasta = 'data/RNAseq/ref/Mus_musculus.GRCm38.dna.primary_assembly.fa'
      gtf = 'data/RNAseq/ref/Mus_musculus.GRCm38.87.gtf'
    }
  }
}

The pipeline runs and submits jobs and exits with claimed success. However, looking at the output, only fastqc on raw and trimmed reads, as well as building the STAR index, is carried out. Does anyone have any idea why the remaining steps of the pipeline are not carried out? I have failed at extracting any (for me useful) information from the log file.

Changing the run to use one of the pre-built genomes, i.e. --genome GRCm38, instead results in a successful run with all expected output files.

Use MultiQC Custom Content for script output

At the moment, the MultiQC_NGI plugin is required to plot some of the output from the pipeline. With the advent of the new MultiQC custom content feature, this can be generalised so that the data is plotted for all MultiQC users.

Add file merging step

RNA samples are often run spread across multiple lanes. Before NextFlow is started, input files need to be merged. Each sample should have a single FastQ file or two if paired end.

eg:

4_160412_AC8B7GANXX_P4701_1006_1.fastq.gz
4_160412_AC8B7GANXX_P4701_1006_2.fastq.gz
5_160412_AC8B7GANXX_P4701_1006_1.fastq.gz
5_160412_AC8B7GANXX_P4701_1006_2.fastq.gz
6_160412_AC8B7GANXX_P4701_1006_1.fastq.gz
6_160412_AC8B7GANXX_P4701_1006_2.fastq.gz
1_240512_DJ3B8OSJXV_P4701_1006_1.fastq.gz
1_240512_DJ3B8OSJXV_P4701_1006_2.fastq.gz

becomes:

P4701_1006_1.fastq.gz
P4701_1006_2.fastq.gz

Container path not expanded properly with Singularity on Uppmax

As the title says. Log output:

Container      : docker://

Pulling Singularity image docker:// [cache /crex2/proj/sllstore2017079/nobackup/private/rickard/test_phil_PR/NGI-RNAseq/tests/work/singularity/.img]

ERROR ~ Error executing process > 'trim_galore (SRR4238351_subsamp)'

Caused by:
  Failed to pull singularity image
  command: singularity pull --name .img docker:// > /dev/null
  message:
  WARNING: pull for Docker Hub is not guaranteed to produce the
  WARNING: same image on repeated pull. Use Singularity Registry
  WARNING: (shub://) to pull exactly equivalent images.
  ERROR Could not parse image ""! Exiting.
  ERROR: pulling container failed!

Merge featureCounts results

Currently each sample has a single featureCounts results file.

It would be nice to have a pipeline process that merges all of these into a single file for easier downstream processing, with a counts column for each sample.

threads management and single node of toy configuration of Torque in the local machine?

I have 18 human samples of which are paired-end RNA-seq data. I used the default setting --profile base to perform NGI-RNAseq procedure. However, the program quickly consumed a lot of hardware source. And I checked it using 'top' command and found that the load was above 70%. Although I specified using the process.cpus = 40 to restrict the max CPU usage from my understanding. But it failed and the number of CUPS reached ~60 and my local machine is 64 (multi-core).

I then used the Torque (4.2.7) and Maui (3.3) to simulate the 'PBS' environment. I edited the base config file and specified executor to 'pbs' and cpus = 40. However, When I run the script, I used the qstat to check all submitted work, but most work is in sleep model (in queue, Q). and I also used the top command to check the cpu and memory usage, they all were in the low load.

I think I missed something in configuration file either in Torque or nextflow config?

Here is the output from Torque (qmgr -c 'p s'):
torque.txt

Here is the nextflow config adapted from the base and modified according to my understanding.
wanglab.txt

Any suggestions? Thanks in advance.

Write a starter DE script

Write the starter R script to help people on their way with differential expression analysis.

Save BAM index file

We create a BAM index file during the RSeQC step, but don't save it to results. It would probably be better to move this index generation step to an earlier process (or a new process) and save it to results.

Problems without specific revision of pipeline when pulling via Singularity

Hi!
neither on local nor on our cluster (binac) I can download the appropriate image...

[warm up] executor > local
ERROR ~ Error executing process > 'fastqc (null)'

Caused by:
  Failed to pull singularity image
  command: singularity pull --name scilifelab-ngi-rnaseq-master.img docker://scilifelab/ngi-rnaseq:master > /dev/null
  status : 1
  message:
    WARNING: pull for Docker Hub is not guaranteed to produce the
    WARNING: same image on repeated pull. Use Singularity Registry
    WARNING: (shub://) to pull exactly equivalent images.
    ERROR MANIFEST_UNKNOWN: manifest unknown
    ERROR: pulling container failed!
-- Check '.nextflow.log' file for details

I think the problematic part is this code piece here in base.config:

container = { "scilifelab/ngi-rnaseq:${workflow.revision ? workflow.revision : 'latest'}" }

This defaults to something else, namely namely this command:

singularity pull --name scilifelab-ngi-rnaseq-master.img docker://scilifelab/ngi-rnaseq:master

Which doesn't work unfortunately - removing the master and things are fine ;-)

I can send a pull request to fix this in case you'd be interested.

Bug report: address 0x6, cause 'memory not mapped' error in the sample_correlation process

Environment: irma4; nextflow version 0.25.7.4531

Command:

/lupus/ngi/production/v1.9.2/sw/nextflow/nextflow run /lupus/ngi/production/v1.9.2/sw//ngi-rnaseq/main.nf -c /lupus/ngi/production/v1.9.2/conf//ngi-rnaseq_sthlm.config --reads P8861*{R1,R2}.fastq.gz --genome GRCm38 --email xxx@xxx -name P8861RNABP --saveTrimmed --saveAlignedIntermediates

Error message:

Loading required package: limma

 *** caught segfault ***
Loading required package: limma

 *** caught segfault ***
address 0x6, cause 'memory not mapped'

Traceback:
 1: dyn.load(file, DLLpath = DLLpath, ...)
 2: library.dynam("limma", pkgname, libname)
 3: fun(libname, pkgname)
 4: doTryCatch(return(expr), name, parentenv, handler)
Loading required package: limma

 *** caught segfault ***
address 0x6, cause 'memory not mapped'

Traceback:
 1: dyn.load(file, DLLpath = DLLpath, ...)
 2: library.dynam("limma", pkgname, libname)
 3: fun(libname, pkgname)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: tryCatchList(expr, classes, parentenv, handlers)
 7: tryCatch(fun(libname, pkgname), error = identity)
 8: runHook(".onLoad", env, package.lib, package)
 9: loadNamespace(package, c(which.lib.loc, lib.loc))
10: doTryCatch(return(expr), name, parentenv, handler)
11: tryCatchOne(expr, names, parentenv, handlers[[1L]])
12: tryCatchList(expr, classes, parentenv, handlers)
13: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {        cat(msg, file = stderr())        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
14: try({    ns <- loadNamespace(package, c(which.lib.loc, lib.loc))    env <- attachNamespace(ns, pos = pos, deps)})
15: library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,     warn.conflicts = warn.conflicts, quietly = quietly)
16: doTryCatch(return(expr), name, parentenv, handler)
17: tryCatchOne(expr, names, parentenv, handlers[[1L]])
18: tryCatchList(expr, classes, parentenv, handlers)
19: tryCatch(library(package, lib.loc = lib.loc, character.only = TRUE,     logical.return = TRUE, warn.conflicts = warn.conflicts, quietly = quietly),     error = function(e) e)
20: require("limma")
aborting ...
/lupus/proj/ngi2016003/nobackup/NGI/ANALYSIS/P8861/rnaseq_ngi/work/e4/a848ee8d343c458fe762ea9a79f679/.command.sh: line 2: 16570 Segmentation fault      (core dumped) edgeR_heatmap_MDS.r "rlocation=/lupus/ngi/production/v1.9.2/sw//ngi-rnaseq/r_packages" P8861_120Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_101Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_136Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_117Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_133Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_126Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_102Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_123Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_135Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_125Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_109Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_108Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_111Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_114Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_127Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_113Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_105Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_115Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_124Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_119Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_132Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_134Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_118Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_107Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_103Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_112Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_122Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_116Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_131Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_128Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_106Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_110Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_104Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_121Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_129Aligned.sortedByCoord.out_gene.featureCounts.txt P8861_130Aligned.sortedByCoord.out_gene.featureCounts.txt

Update to latest StringTie version

Might need to re-write the actual command.

Optional merge process?

Could be nice to have an optional merge_regex param. If specified, the FastQ files would first run through a merge process, as done in Cluster Flow.

Single end command suggestions

To make the pipeline work with single end data as well as paired-end, we will need different commands for the pre-alignment steps. Once we have a bam file, everything should be the same. I think.

Current paired-end specific jobs are:

FastQC
- fastqc -q ${read1} ${read2}
Trim Galore!
- trim_galore --paired --gzip --fastqc_args "-q" $read1 $read2
STAR:

STAR --genomeDir $index \\
         --sjdbGTFfile $gtf \\
         --readFilesIn $trimmed_read1 $trimmed_read2 \\
         --runThreadN ${task.cpus} \\
         --twopassMode Basic \\
         --outWigType bedGraph \\
         --outSAMtype BAM SortedByCoordinate\\
         --readFilesCommand zcat\\
         --outFileNamePrefix \$prefix

FastQC is easy, that runs on a per-file basis anyway so we just remove read 2:

fastqc -q ${read1}

Trim Galore! isn't much more difficult:

trim_galore --gzip --fastqc_args "-q" $read1

And again for STAR I think we just drop read 2:

STAR --genomeDir $index \\
         --sjdbGTFfile $gtf \\
         --readFilesIn $trimmed_read1\\
         --runThreadN ${task.cpus} \\
         --twopassMode Basic \\
         --outWigType bedGraph \\
         --outSAMtype BAM SortedByCoordinate\\
         --readFilesCommand zcat\\
         --outFileNamePrefix \$prefix

This one will need testing as I'm not so familiar with star.

biotypes chart showing only last 7 entries

Hi,
is this on purpose? it selects only last seven biotypes from tmp_file , missing protein coding genes and other important groups leaving the chart quite meaningless (at least in my case).

NGI-RNAseq/main.nf

Line 944 in 632aef1

  cut -f 1,7 ${bam_featurecounts.baseName}_biotype.featureCounts.txt | tail -n 7 > tmp_file 

I'd suggest adding something like
cut -f 1,7 ${bam_featurecounts.baseName}_biotype.featureCounts.txt | sort -k 2 -n -r | head -n 10 > tmp_file

thanks,
JK

Problems running on cs3e Hebbe

Hello,
I'm trying to run the pipeline on Hebbe (C3SE), but I have some problems with the configuration, which makes the pipeline fail at the process genebody_coverage and markDuplicates due to memory problems.

I can tell you straight away that I have limited linux and programming skills, but I'll try to explain the problem anyway.

The pipeline gives me this error:

[35/e80538] NOTE: Error submitting process 'genebody_coverage (NC1_S1_L002_R1_001AlignedByCoord.out)' for execution -- Error is ignored

ERROR ~ Error executing process > 'markDuplicates (NC1_S1_L002_R1_001AlignedByCoord.out)'

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:

  sbatch .command.run

Command exit status:
  1

Command output:
  sbatch: error: rejecting job, too much memory requested for 2 cores. You must use -C MEMXXX or increase number of requested cores instead of using --mem: pn_min_memory = 16384
  sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification"

This is my configuration file (from this guide)

config_TEMPLATE.txt

I've had the similar problem with starting trim galore, but that was fixed by adding cpus=8 for trim galore in the config file. I tried to do the same for dupradar (not sure that it's connected to mark_Duplicates), but that gave the same error.
I also tried to use this config file instead, but that could not start trim galore due to memory problems.

Here's my nextflow script:
rna_seq_script.txt

I submitted the job using this command line:

sbatch -C "MEM64" rna_seq_script.sh

log files:
NGI-RNAseq_trace.txt
nextflow.log
mystdout.txt

Any suggestions?

Best regards,
John Hellgren

Testing only ever tests Denis' fork

Get the travis file to run the pipeline from the repo that is running the test..

https://github.com/SciLifeLab/NGI-RNAseq/blob/master/.travis.yml#L16

dupRadar error

Hi, awesome pipeline, thanks for making it available. I'm new to nextflow and R, so am not confident working this out, but I get the following error at the dupRadar step. Any ideas? Using R 3.3.0 (I note you have 3.2.3).

Command error:
  Loading required package: dupRadar
  Error in paste0(bam_md, "_intercept_slope.txt") : 
    object 'bam_md' not found
  Calls: cat -> paste0
  Execution halted

Cannot download/upload from s3 within nextflow

Not sure if this is a nextflow issue or not.

When I run the following command:

nextflow run SciLifeLab/NGI-RNAseq \
    --reads 's3://tx_analysis/fq/*{1,2}.fastq.gz' \
    --outdir 's3://tx_analysis/results/2017-10-02-initial' \
    --saveReference \
    --saveTrimmed \
    --saveAlignedIntermediates \
    --email '[email protected]' \
    --genome GRCh37 \
    -profile aws \
    -resume

it comes back as saying that no reads exist:

ERROR ~ Cannot find any reads matching: s3://tx_analysis/fq/*{1,2}.fastq.gz
NB: Path needs to be enclosed in quotes!
NB: Path requires at least one * wildcard!
If this is single-end data, please specify --singleEnd on the command line.

Doing aws s3 ls s3://tx_analysis/fq/ works. Also, just experimenting on my own I was able to get nextflow to find files on s3 using Channel.fromfile('s3://tx_analysis/fq'); But there is no wildcard used here.

I copied the AMI to US-east-2 in order to do this. Maybe that's the issue?

Update the Docs

https://github.com/SciLifeLab/NGI-RNAseq/blob/master/docs/usage.md#reference-genomes
The info above is now outdated, needs to be changed so that it refrences the igenomes.config file correctly.

trim_glore cannot find fastq files

Hi
I follow the instruction to install NGI-RNAseq pipeline and I try the following:

nextflow run NGI-RNAseq/main.nf --reads '/gpfs/software/nextflow/test_data/*_R{1,2}.fastq.gz' --genome hg19 --project test1

my fastq data was located on /gpfs/software/nextflow/test_data/ , but trim_glore still cannot find fastq file:

ERROR ~ Error executing process > 'trim_galore (test)'

Caused by:
  Failed to submit job to grid scheduler for execution

Command executed:

  bsub

Command exit status:
  255

Command output:
  (more omitted..)
  
         bsub -pack job_submission_file
  
         bsub -h[elp] [all] [description] [category_name ...]
         [-option_name ...]
  
         bsub -V
  
  Categories and options
         Use the keyword all to display all options and the keyword
         description to display a detailed description of the bsub com-
         mand. For more details on specific categories and options, spec-
         ify bsub -h with the name of the categories and options.
  
  	  Category: dc
  	  Dynamic Cluster only. Submit Dynamic Cluster jobs:
  	  -dc_chkpntvm, -dc_mtype, -dc_tmpl, -dc_vmaction.
  
  	  Category: io
  	  Specify input/output files and directories: -cwd, -e, -eo, -f,
  	  -i, -is, -o, -oo, -outdir, -tty.
  
  	  Category: limit
  	  Specify limits: -c, -C, -D, -F, -hl, -M, -p, -S, -T, -ul, -v,
  	  -W, -We.
  
  	  Category: notify
  	  Control user notification: -B, -K, -N, -u.
  
  	  Category: pack
  	  Submit job packs (single files containing multiple job
  	  requests): -pack.
  
  	  Category: properties
  	  Specify job submission properties: -app, -ar, -E, -env, -Ep,
  	  -g, -I, -Ip, -Is, -IS, -ISp, -ISs, -IX, -J, -Jd, -jsdl,
  	  -jsdl_strict, -k, -K, -L, -P, -q, -Q, -r, -rn, -rnc, -s, -sla,
  	  -sp, -wa, -wt, -XF, -Zs.
  
  	  Category: resource
  	  Specify resources and resource requirements: -clusters, -freq,
  	  -hostfile, -m, -n, -network, -R, -U.
  
  	  Category: schedule
  	  Control job scheduling and dispatch: -a, -b, -clusters, -ext,
  	  -G, -H, -Lp, -mig, -t, -ti, -U, -w, -x.
  
  	  Category: script
  	  Use job scripts: -Zs.
  

Work dir:
  /gpfs/software/nextflow/work/c3/334795be28241ac16775755acf3009

when run the .comand.sh inside, it appears that trim_glore cannot find fastq file.

STAR Index Generation

Hi everyone,

just found one potential thing that could be interesting to investigate:

In the STAR Index generation phase, the option -sjdbOverhang 149 value is used (default is 100).
Is there a specific reason for that? In my case, the genome generate step failed in some cases - whereas it ran through with the exact same command with default settings.

The ideal value is apparently MateLength -1, and the author suggests here, that for longer reads the default value of 100 should be fine (and the author even states that for most cases this will work fine, see the manual here (page 5) and the discussion here.

Maybe we should consider using the default here too?

Add optional UMI handling

It would be nice to add optional handling of UMIs, probably using this project: https://github.com/CGATOxford/UMI-tools

Add warning to the top of MultiQC reports if samples are skipped due to low alignment

Although a warning is given in the Nextflow log when a sample is skipped, it's not so obvious after the run is finished and forgotten.

It would be good to add a section to the top of the MultiQC report highlighting any skipped samples. We should be able to do this using a custom content file and kind of trick MultiQC.

For example, call it alerts_mqc.yaml and have something like this (untested):

id: ngi-rnaseq-alerts
section_name: NGI-RNAseq Warnings
description: |
  <div class="alert alert-danger">
    <p><strong>Warning!</strong> 14 samples were halted after alignment due to very low alignment rates.</p>
    <ul>
      <li>Sample 1 (2.3% aligned)</li>
      <li>Sample 2 (0.4% aligned)</li>
    </ul>
  </div>

I'm not quite sure what happens if you don't supply any plot data to Custom Content, so if this doesn't work right away then I'll take a look into the inner-workings of MultiQC.

Look into adding RSEM

A few people have asked about using RSEM to generate gene counts instead of (in addition to?) StringTie. It would be interesting to look into adding this either as an addition or as an alternative to StringTie.

From e-mail thread:

I was wondering whether you plan to also include RSEM for transcript-level quantification in near future (Irizarry's group has shown that RSEM slightly outperforms other methods (Teng et al., Genome Biology, 2016)). StringTie was not included in the comparison though.

Benchmark the pipeline

A recent paper describes a tool to benchmark RNA-Seq analysis pipelines, could be cool:
http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0940-1

http://bioconductor.org/packages/release/bioc/html/rnaseqcomp.html

Displaying version numbers in the MultiQC report

It would be useful to have the version number of each software used when running the pipeline.

Steps of the pipeline that contains the version number and in which files:

FastQC - in .html: grep -e”version”
Trim galore: in trimming_report Cutadapt version: 1.9.1
STAR: in log.out top line: STAR version=STAR_2.5.1b
featureCounts in _biotype.featureCounts.txt: # Program:featureCounts v1.5.0

These process currently lack information about their version number in their log files.

dupradar
MarkDuplicates
Preseq
stringtie
samplecorrelation

IDEA: DOIs for repository

Hi everyone,

https://zenodo.org/ offers the possibility to create a DOI for a repository on a per-release scheme automatically. Is this something we might integrate here as well? Users could then cite the pipeline directly with a fixed version number and accompanying DOI.

I can offer to take care of it and add a Badge to the main repository, too if there is some interest. At least I believe it doesn't harm ;-)

Collect software versions

It's nice to have all of the software versions in the summary HTML file.

I've just done this in the NGI-MethylSeq pipeline, see the stdout parsing here.

Look into nextflow clean

As of v0.22.0 of Nextflow, there are new command line options log and clean which can be used to delete working directories from pipeline runs.

It could be interesting to look into using these after successful pipeline executions? Then we don't use loads of space routinely but can rerun from cached files in case of an error.