sanjaynagi / ampseeker Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 3.0 79.47 MB

A state-of-the-art snakemake workflow for amplicon sequencing

Home Page: https://sanjaynagi.github.io/AmpSeeker/

Python 18.72% Jupyter Notebook 81.28%

amplicon-sequencing gene-drive genomic-surveillance illumina insecticide-resistance snakemake-workflow

ampseeker's Introduction

Hello there 👋

I'm Sanjay Curtis Nagi, a researcher studying the major malaria mosquito Anopheles gambiae 🦟

ampseeker's People

Contributors

Watchers

Forkers

chabbytmd eddug poppy541

ampseeker's Issues

Replace TrimFastQs from bbmap to alternative tool

bbduk is not a dependency, lets avoid

create symlink to jupyter-book which is stored in main workflow directory

or results/ directory.

Symlink should open straight to web-browser when linked to.

STDOUT message upon running pipeline

For example:

add quality control steps post variant-calling

Things like filtering on depth
Heterozygosity (Hardy-weinberg)
Need inspiration for other thresholds, worth looking back at Ag1000G QC steps, or any other AmpSeq pipelines/analyses.

AgamDao modules

We will need to implement a species ID script for AgamDao.

This will be the first protocol-specific bit of analysis, and so we will need to think about the best way to approach this. There could simply be a series of options in the config for each protocol AmpSeeker supports.

Modules:
  AgamDao: True #anopheles
  GRC1: False #plasmodium

TODO

Species ID
KDR haplotypes/diplotypes
???

coverage by sample

add coverage by sample in the notebook

implement bcl to fastq conversion

Need a rule at the start of the workflow to convert and demultiplex BCL files from the Illumina miseq output directory to fastq.
The command should be something like -

bcl-convert --bcl-input-directory {illumina_out_dir} --output-directory resources/reads --sample-sheet {illumina_out_dir}/SampleSheet.csv

CI does not test starting from Illumina BCL folder

we only test starting from reads

Assess index read quality

This will involve getting bcl-convert to produce fastqs for the index read. however, I could only get bcl2fastq to do it, not bcl-convert, so we might want to change to bcl2fastq, the older software.

We can then use fastqc or fastp on the index reads.

Heatmap of reads per well

A script which after demultiplexing, counts reads per sample and makes a heatmap of reads per well of each input plate.

igv-notebook

Use a jupyter notebook with papermill to explore read data in IGV

Add interactive sample map

using ipyleaflet as in https://github.com/anopheles-genomic-surveillance/selection-atlas

BCL Conversion Failing.

Workflow throws a
ChildIOException: File/directory is a child to another output: when provided with an Illumina data folder.

Implement coverage notebook

papermill / nb style
Coverage at each amplicon
Coverage at each SNP

add notebook to add 'remove-input' tag to papermill parameter cells

Implement allele frequency calculation and visualisation

Calculate allele frequencies
Plot as heatmap with plotly

Implement gatk

Replace mpileup? eventually.
For now keep in gatk.smk

introduce snakemake checkpoint for samples with no data

Sometimes samples will have zero data, which makes the pipeline fail.

To handle this, the pipeline should start with a checkpoint which evaluates what samples actually have data and uses these to run the rest of the pipeline with.

Make tutorial video

on Docs book home-page

add multiqc rule to collect qc data

Its a really useful tool https://multiqc.info/
example usage in a snakemake workflow here - https://github.com/sanjaynagi/rna-seq-pop/blob/master/workflow/rules/qc.smk

make bwa index output files explicit

set standards for bed file, and accompanying tsv file

What columns are required?

Fix jupyter results book plots not showing

Need to use diff conda env (try selection-atlas env)

Pin python version in conda environments

Pin to python=3.8/3.9

Download previous AgamDao data and evaluate genotyping accuracy

Need to download larger subset of data (~100, 200 samples at least?)
And find corresponding Ag1000g sampleIDs so we can match them up
evaluate genotype concordance

add links to fastp and fastqc htmls within AmpSeeker results book

Build private web page of all results with Jupyter book

@ChabbyTMD @eddUG

I was having a think, and I think it should be possible to use Jupyter-book within snakemake to build a private web-site, which contains all of the results of the workflow for each users analysis.

This would be really cool, imagine that you wouldn't even need to look in the results folder, one could simply open the webpage and explore all the results that way. If we make the analyses in papermill/Jupyter-notebooks, this shouldn't be too complicated!

(Ill explain in our next meeting, but jupyter-book can basically take as input a load of jupyter notebooks and build a website from it).

fix logo

add plasmodium/bacteria/virus and mosquito

Automatic merging of vcf files depending on sample size

Currently, the pipeline has a rule which splits the bcftools merge step into two groups, as when running more than 1000 samples, bcftools merge fails. It also means the pipeline can fail if there are less than 1000 samples.

We should write this so that this is automated, I.e merging is only done in 2 steps if there are more than 1000 samples.

move result book from docs/ to results/

It is not documentation...

IGV notebook not working in results book

bcl convert not available through conda

Need to change to bcl2fastqe

Make documentation webpage

As well as the results Jupyter-book, we also want a jupyter book which will be public, and contain the documentation, for things like

setting up AmpSeeker (config, input files)
contributing
troubleshooting

This will be hosted within the AmpSeeker repo using github pages.

need to ensure we are implementing some sensible QC steps, I dont quite know what these would look like atm.

On called variants
Should be a jupyter notebook run by papermill, plotting done with plotly to be interactive

move input file list to common.smk

We should keep the snakefile tidy, and so have an input function for rule all, which resides in a rule file called common.smk. This is good practice in snakemake.

This function will determine which output files we want to produce, based on the config.yaml

change all plotly plots to white background

e.g px.box(df, x='x', y='y', template='simplewhite')

Looks are lot better than the grey