The ngi-exoseq from scilifelab

ngi-exoseq's Issues

Software Versions collection in a single process

Can we define an easy way when we have single containers for each process how to collect software versions in a more cleaner way?

JointDiscovery Workflow: Open Points

Requirement is (according to BP) a total of >30 Exomes for VQSR or 1 WGS sample.

Aquire 35 Exome BAMS from 1000G
Generate gVCFs for these
Use them for jointDiscovery workflow (VQSR)

Try to generalise pipeline more

At the moment some of the commands are fairly tied to the NGI infrastructure (eg. regexes that assume sample names looking like P1234). It would be good to try to generalise this as much as possible, moving stuff out into param variables if required.

Add Travis Tests to ExoSeq

We should have Travis Testcases for ExoSeq asap

To be implemented

The followings have to be implemented yet

Build a fat container

Create Dockerfile
Try pushing that to quay.io in nf-core repository
Adapt config files to utilize this instead of different containers
Rename GATK 4 to gatk-launch (and all other tools, too)

Benefits: We can simply have a simple process collecting metrics instead of having to do that in individual processes, "cleaner" approach.

Waiting for @ewels to allow quay.io access to nf-core....

Documentation update

Ideally, we should have the same structure as in the NGI-RNAseq repository and as proposed by the CookieCutter module @ewels recently provided

Integrate Reports

I submitted a pull request to integrate MultiQC support in ExoSeq. This features all tools in the pipeline, including:

FastQC
Picard MarkDuplicates
GATK VariantEval
QualiMap
SnpEff

Indel realignment obsolete

Hi!

Took some inspiration from your project for another exome pipeline I am working on. I noticed that you are still using the indel realignment step as by the older GATK best practice. I just wanted to point out that this step is no longer considered best practice when HaplotypeCaller is used:
https://software.broadinstitute.org/gatk/blog?id=7847

Will cut down on the processing time significantly, I reckon.

Cheers,
Marc

Perform final code review and cleanup

We should at one point introduce code review and clean up the repo from remains that should not be there at that point ;-)

Implement GATK 4 Support

GATK 4 will be published on January 9th 2018. We should consider moving most of the calls to support GATK 4 directly as it both speeds up important parts of the analysis and achieves higher sensitivity and specificity.

That does mostly involve generating a new container for GATK4, setting it up and checking whether the calls have changed (which they most certainly didn't, at least not too much).

Support for mixed Capture Methods

The current version of the pipeline supports "only" a single kit for all samples. We have the situation that people come here with "mixed" datasets (e.g. Agilent v3,v4,v5) and then require to analyze everything together. While this is not optimal, there is certainly at least a partial overlap between samples and we should support setting things up.

I assume something like a CSV file with <ID>\t<kit_type> should work as input if a mixed input is used.

Evaluate / Integrate Google's DeepVariant

Evaluate if an optional process can prepare Exome-Kit files

We should evaluate whether we can automatically prepare Exome-Kit files for the pipeline if users don't specify explicitly which BED files to use for a certain exome kit. Currently, we only have documentation up that suggests how to achieve that, but this process could potentially be automatically performed, too.

Polish MultiQC Report

Ideally, do something similar to CAW and collect more information in a separate process, then creating the appropriate YAML file for MultiQC.

Automatically check Reference Files are compatible

The pipeline should be able to check certain files for consistency: E.g., determine whether reference genome is in the same order as selected dbSNP files, exome BED file(s). Otherwise the pipeline will break at a later point, confusing users and annoying developers too ;-)

Move kits to a base params definition (similar to igenomes)

I'll move the description of all kits and genome files to a base params definition and document this.

e.g. you only specify -genome "blagarbl" and the pipeline looks in the cluster specific configuration / base configuration where the genome files for the requested files are located at.

move paths to a certain "base dir" (and not hard link them)
documentation update to specify what is expected in such a path
test things

scilifelab / ngi-exoseq Goto Github PK

ngi-exoseq's People

Contributors

Stargazers

Watchers

Forkers

ngi-exoseq's Issues

Recommend Projects

Recommend Topics

Recommend Org