vanheeringen-lab / seq2science Goto Github PK

Automated and customizable preprocessing of Next-Generation Sequencing data, including full (sc)ATAC-seq, ChIP-seq, and (sc)RNA-seq workflows. Works equally easy with public as local data.

Home Page: https://vanheeringen-lab.github.io/seq2science

License: MIT License

Python 77.03% AngelScript 0.57% Shell 11.01% R 10.88% Perl 0.51%

snakemake bioinformatics bioinformatics-pipeline atac-seq rna-seq chip-seq reproducible-research ngs pipeline sra

seq2science's People

Contributors

Stargazers

Watchers

seq2science's Issues

Rule map_auto fails when tmp files are present

If the bwa mapping gets interrupted, the sort process leaves *tmp.*.bam files. Either sort needs to be updated, or these files need to be deleted before a run starts.

Add resources (e.g. mem usage) to each rule

Based on bechmark results?
#5

GSMs can contain multiple SRRs

example: GSM352203

Split based on PE or SE, remove splitsplot architecture

Allow for replicates in peak calling

in cluster environment every rule re-does the PE&SE lookup

Implement --consider_already_trimmed in trim_galore

wait for a new release, see FelixKrueger/TrimGalore#64

Submit multiple rules together that are not multithreaded for cluster with no shared nodes (e.g. cartesius)

related: #8 #7

The dedup bam file should be indexed

samtools index can be run on the bam file that is generate by the rule mark_duplicates

SRA downloading improvements

https://github.com/vanheeringen-lab/snakemake-workflows/blob/7946b5b6bd36dd8e07ccd810b29a1d7a6d7d2bca/rules/get_fastq.smk#L2

In this rule, it might be worth adding a few extra checks and print/echo statements for logging purposes. If this rule fails it is almost always not clear why and at which point it fails.
Maybe add a retry for esearch or equery if these tools fail due to a timeout?

Pipeline fails on samples with restricted access

See for instance GSM1127077, which links to https://www.ncbi.nlm.nih.gov/sra?term=SRX1157701. This is only accessible via dbGap (which we should not bother supporting at the moment).

replace hardcoded folder names by variables

Dynamic ascp

An elegant way of incorporating the installation of ascp:
https://github.com/vanheeringen-lab/snakemake-workflows/blob/54f7e5aaca2f410d120a8a1fcace7a4d18797102/rules/get_fastq.smk#L1-L16

Currently the ascp path and the key are hardcoded.

options:

install ascp manually, and make sure the hard-coded paths are correct (current situation)
install ascp manually, do a lookup
- lookup once per rule
- lookup once per workflow (e.g. in onstart)
install ascp through a rule, so the hardcoded paths are enforced

Alignment (splot dumped files)

Get it working

base pair shift

We could add for completeness base pair shift
https://www.biostars.org/p/187204/#187206

MACS2: remove RNEXT flag for peak calling

MACS2 throws away half of the reads when using BAM mode. When using BAMPE the reads get 'interpolated' in between, which we do not want for ATAC-seq.

Ideally align with paired end data, but for peak calling remove the information that the reads are paired end? RNEXT flag in bam

Split rules + environments into more logical groups

Have the pipeline work on local files (not downloaded)

Implement automated tests with travis

#14 related

Make threads scale automatically for cluster with no shared nodes (e.g. cartesius)

Rules using a variable amount of threads should max this out if run on e.g. cartesius.

Snakemake validate only sets defaults when column is missing, not when values are missing..

Ideally it would also defaults when the value is missing

sra2fastq_PE dumping goes wrong if lowercase exists in result_dir

MACS2 combine replicates more than two replicates

Currently only supported for 2 replicates, maybe PR?

Incremental configurations

Might be confusing to see all the parameters that do not apply. Maybe config.schema.yaml per workflow

Keep logs in workflow, it now migrates to subworkflows

When no log specified, it gets auto set:
https://github.com/vanheeringen-lab/snakemake-workflows/blob/a58efbeb0cf5d7b92e669029a280f2c924d83709/schemas/config.schema.yaml#L27-L30

When working with local files, print better error messages when these can not be found

Local files can not be found since they might not be in result_dir, or fastq_dir, or their fqsuffix and/or fqext is wrong. The error

Checking if samples are single-end or paired-end...
CalledProcessError in line 59 of /home/sande/Dropbox/Studie/PhD/snakemake-workflows/rules/configuration.smk:
Command 'esearch -api_key ba36a74749126e0d9558b7e19967417c3407 -db sra -query GSM12345555555 | efetch -api_key ba36a74749126e0d9558b7e19967417c3407 | grep -Po "(?<=<LIBRARY_LAYOUT><)[^/><]*"' returned non-zero exit status 1.
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/workflows/atac_seq/Snakefile", line 11, in <module>
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/rules/configuration.smk", line 78, in <module>
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/rules/configuration.smk", line 78, in <dictcomp>
  File "/home/sande/anaconda3/envs/snakemake-workflows/lib/python3.7/multiprocessing/pool.py", line 657, in get
  File "/home/sande/anaconda3/envs/snakemake-workflows/lib/python3.7/multiprocessing/pool.py", line 121, in worker
  File "/home/sande/Dropbox/Studie/PhD/snakemake-workflows/rules/configuration.smk", line 59, in get_layout
  File "/home/sande/anaconda3/envs/snakemake-workflows/lib/python3.7/subprocess.py", line 395, in check_output
  File "/home/sande/anaconda3/envs/snakemake-workflows/lib/python3.7/subprocess.py", line 487, in run

is completely uninformative

Change subworkflow architecture into include architecture?

ATAC-seq

Move https://github.com/vanheeringen-lab/atac-seq to snakemake-workflows!

Config files are not inherited

Ternary operators as output seem to be allowed, remove PE & SE rules?

If the multiqc output files already exist (for instance, from a previous test run), then it will automatically create files with a _1 suffix. However, this means the workflow will fail as the "correct" files according to snakemake are not generated.

vanheeringen-lab / seq2science Goto Github PK

seq2science's People

Contributors

Stargazers

Watchers

Forkers

seq2science's Issues

Recommend Projects

Recommend Topics

Recommend Org