bedapub / hbvouroboros Goto Github PK

HBVouroboros automates sequencing-based HBV genotyping and expression profiling

License: GNU General Public License v3.0

Makefile 0.40% Python 98.08% Shell 0.66% R 0.86%

hbv bioinformatics snakemake snakemake-workflow genotyping expression

hbvouroboros's Introduction

HBVouroboros automates sequencing-based HBV genotyping and expression profiling

HBVouroboros uses RNA-sequencing reads to infer HBV genotype, quantify HBV transcript expression, and perform variant calling of HBV genomes.

HBVouroboros, distributed under the GPL-3 license, is available at https://github.com/bedapub/HBVouroboros.

Installation and usage

Download the source code

git clone https://github.com/bedapub/HBVouroboros.git

Setup conda environment

## setup conda environment
cd envs; conda env create; cd -
## in case it has been installed, use the command below to update
## conda env update
conda activate HBVouroboros

Run an example

An out-of-box example can be run by starting the snakemake pipeline.

snakemake -j 99 --configfile config/config_template.yaml --use-envmodules ## use --use-conda if no R module is present

Run the pipeline with your own data

Create a config file by copying the template.

cp config/config_template.yaml config/config.yaml

Next, modify the config/config.yaml file to specify a sample annotation file, and make other changes if necessary.

Run HBVouroboros using unmapped reads from a Biokit output directory

This feature has been disabled now. It may be activated in the future.

Validating the sensitivity and specificity of HBVouroboros with RNAsim2

We created RNAsim2, a RNA-seq simulator to validate the sensitivity and specificity of HBVouroboros. See RNAsim2/README.md for details.

Known issues and solutions

What to do if conda environment initialization takes too long?

Above we use the default conda solver. If you suffer from slow speed of conda, consider using mamba, which is a drop-in replacement of conda.

If you met more issues, please raise them using the Issues function of GitHub.

hbvouroboros's People

Contributors

Stargazers

Watchers

Forkers

dingailum

hbvouroboros's Issues

One needs to touch bowtie index file

Adding pytests and reproducible examples

Add pytests to make sure that the functions do what they are expected to do, and add reproducible examples to demonstrate how to run the pipeline.

Adding user-friendly report

Use Rmarkdown and/or Jupyter ntoebook to add a user-friendly report, which include

mapping statistics
genotyping
visualization of genomes and features
SNP/SV information

See the biokit pipeline for inspirations

Add fastqc and multiqc report for FASTQ/BAM files

Following the example of mpsnake

Remove wrappers and expose the workflow as a repository

Currently snakemake workflows are wrapped by python wrappers. For instance, HBVouroboros_build_refgenomes.py wraps
build_refgenomes/Snakefile in the HBVouroboros package. This can work, but it causes significant overhead in development, because after every change the HBVouroboros package needs to be installed to make the change effective.

An alternative is to abolish the need of a package, instead expose the workflow directly in a repository, as suggested by the Snakemake documentation. And the internal mpsnake pipeline provides an example as well.

Merge AA variant calling results into Snakemake pipeline

Mapped read files are empty

I tried to run the pipeline with a dataset, but had several problems.

Firts I ran the pipeline without adjusting the config, except for the sample annotation file. After that I tried to set doPerSamp to True. With both configurations these files are empty and the pipeline fails. (See error message below)

02_Sample_mapped_reads_1.fq.gz
02_Sample_mapped_reads_2.fq.gz

# running normalization on reads: $VAR1 = [
          [
            '/gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_1.fq.gz'
          ],
          [
            '/gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_2.fq.gz'
          ]
        ];


Tuesday, May 23, 2023: 16:10:39 CMD: /gpfs/scratchfs01/site/u/ferraing/conda/envs/HBVouroboros/opt/trinity-2.12.0/util/insilico_read_normalization.pl --seqType fq --JM 10G  --max_cov 200 --min_cov 1 --CPU 1 --output /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/perSamp_trinity/02_Sample/trinity/insilico_read_normalization --max_CV 10000  --left /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_1.fq.gz --right /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_2.fq.gz --pairs_together  --PARALLEL_STATS
-prepping seqs
Converting input files. (both directions in parallel)CMD: seqtk-trinity seq -A -R 1  <(gunzip -c /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_1.fq.gz) >> left.fa
CMD: seqtk-trinity seq -A -R 2  <(gunzip -c /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_2.fq.gz) >> right.fa
Error, no records were correctly parsed from /dev/fd/63Thread 1 terminated abnormally: Error, cmd: seqtk-trinity seq -A -R 1  <(gunzip -c /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_1.fq.gz) >> left.fa died with ret 1280 at /gpfs/scratchfs01/site/u/ferraing/conda/envs/HBVouroboros/opt/trinity-2.12.0/util/insilico_read_normalization.pl line 793.
Error, no records were correctly parsed from /dev/fd/63Thread 2 terminated abnormally: Error, cmd: seqtk-trinity seq -A -R 2  <(gunzip -c /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_2.fq.gz) >> right.fa died with ret 1280 at /gpfs/scratchfs01/site/u/ferraing/conda/envs/HBVouroboros/opt/trinity-2.12.0/util/insilico_read_normalization.pl line 793.
Error, conversion thread failed at /gpfs/scratchfs01/site/u/ferraing/conda/envs/HBVouroboros/opt/trinity-2.12.0/util/insilico_read_normalization.pl line 336.
Error, cmd: /gpfs/scratchfs01/site/u/ferraing/conda/envs/HBVouroboros/opt/trinity-2.12.0/util/insilico_read_normalization.pl --seqType fq --JM 10G  --max_cov 200 --min_cov 1 --CPU 1 --output /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/perSamp_trinity/02_Sample/trinity/insilico_read_normalization --max_CV 10000  --left /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_1.fq.gz --right /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/results/02_Sample_mapped_reads_2.fq.gz --pairs_together  --PARALLEL_STATS   died with ret 7424 at /home/ferraing/scratch/conda/envs/HBVouroboros/bin/Trinity line 2869.
        main::process_cmd("/gpfs/scratchfs01/site/u/ferraing/conda/envs/HBVouroboros/opt"...) called at /home/ferraing/scratch/conda/envs/HBVouroboros/bin/Trinity line 3422
        main::normalize("/gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FD"..., 200, ARRAY(0x55f6c69af7b0), ARRAY(0x55f6c69af7f8)) called at /home/ferraing/scratch/conda/envs/HBVouroboros/bin/Trinity line 3362
        main::run_normalization(200, ARRAY(0x55f6c69af7b0), ARRAY(0x55f6c69af7f8)) called at /home/ferraing/scratch/conda/envs/HBVouroboros/bin/Trinity line 1384
[Tue May 23 16:10:39 2023]
Error in rule run_trinity_perSamp:
    jobid: 125
    input: results/02_Sample_mapped_reads_1.fq.gz, results/02_Sample_mapped_reads_2.fq.gz
    output: results/perSamp_trinity/02_Sample/trinity/Trinity.fasta

I also tried to set doInputRef and doPerSamp to True, but then the pipeline couldn't start at all.

MissingInputException in rule get_ref_strain_gb_inpt in file /gpfs/scratchfs01/site/u/ferraing/projects/2023-05-HBV-SNV-FDA-PS-13785/test/HBVouroboros/workflow/rules/align_reads.smk, line 335:
Missing input files for rule get_ref_strain_gb_inpt:
    output: results/inpt/inpt_strain.gb
    affected files:
        AB064313

As a reference I used the sampleAnnotation file under the .test folder and with this file the pipeline always worked.

multiqc use `--force` option

in order to overwrite existing output files use the option --force in the multiqc rules.

Use prespecified reference strain for variant calling and reporting

Currently, the reference strain is inferred in a data-driven manner.

For regulatory agency reporting, we also want to specify a reference strain as the template against which the variant-calling is calling (instead of only inferred strain).

vcf files should remove/overlay duplicated genome parts

Refacotring based on version 0.8-0

Following commit 38d81d8:

Clean up code and imports
Lint
Document functions

Export unmapped reads as BAM files

Now we only export mapped reads as BAM files. For debugging reasons, we can also export unmapped reads as BAM files.

Issues to be solved before publication

first priority

convert steps in build_refgenomes.smk to a separate Python script. The user only needs to run it when an update of HBVdb is needed.
Improve the visuals of the HTML report: with CSS.
Make sure Docker image is correctly built and pushed

second priority

rename biokit to bksnake
document correct_bam
check consistency: use shell directly if necessary, not run: shell(shell), for instance in varscan_vc.smk
separate snakemake (.smk) files from python functions (.py)
Update reference data in repository

Documenting the pipeline

We can use Sphinx to document the HBVouroboros pipeline.

Change Rplot.R to python code

Set up Docker environment for HBVouroboros

Setup a Github Action to automate the testing

The .test directory already includes test data. We can use them to automate the testing.

Check whether all required information is shown in the HTML report

Add if not

Make RNAsim2 part of the snakemake pipeline

multiqc with --force

I would use the option --force in multiqc because if one runs the same data set several times, without this option multiqc will create each time a new folder with another name which makes is impossible to fulfill the rule for the multiqc html report.

Allow genotype inference on either the study level or the sample level

Now genotype inference is done on the study level, for clinical samples it makes sense to allow inference on the sample level.

We would like to specify this option in the config file, giving the users a choice to do it either in a per-study way or in a per-sample way.

Use synthesized data to demonstrate sensitivity and specificity of the analysis using HBVDB

We start with read files of human hepatocytes. Then we add synthesized read data by randomly drawing from HBV reference genomes (potentially with a fixed probability of point mutations). We can validate the sensitivity and specificity of both gene expression and genotyping in this way.

bedapub / hbvouroboros Goto Github PK

hbvouroboros's Introduction

HBVouroboros automates sequencing-based HBV genotyping and expression profiling

Installation and usage

Download the source code

Setup conda environment

Run an example

Run the pipeline with your own data

Run HBVouroboros using unmapped reads from a Biokit output directory

Validating the sensitivity and specificity of HBVouroboros with RNAsim2

Known issues and solutions

What to do if conda environment initialization takes too long?

hbvouroboros's People

Contributors

Stargazers

Watchers

Forkers

hbvouroboros's Issues

first priority

second priority

Recommend Projects

Recommend Topics

Recommend Org