mjmlab / pyinseq Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 6.0 113.99 MB

Python pipeline for analyzing INSeq Insertion Sequencing data

License: BSD 3-Clause "New" or "Revised" License

Python 99.61% Makefile 0.39%

inseq tn-seq snakemake-workflow genbank-files

pyinseq's Introduction

Lightweight python package to map transposon insertion sequencing (INSeq) data in bacteria.

Quick start

This section is meant for users who know their way around terminal and conda. To use pyinseq, create a virtual environment with python 3.7 and install pyinseq using conda.

$ conda install -n base -c conda-forge mamba
$ conda create -n pyinseq-py37 python=3.7
$ conda activate pyinseq-py37
(pyinseq-py37) $ conda install -c bioconda bowtie
(pyinseq-py37) $ pip install pyinseq

Verify your installation with --test

(pyinseq-py37) $ pyinseq --test

Now you can run pyinseq!

(pyinseq-py37) $ pyinseq -i <input file> -s <sample file> -g <genbank file> -e <experiment name>

Overview of Command Line Operation
Output description
Notes on output
Background
User guide
Installation
FAQ
License

Overview of Command Line Operation

Basic operation and a short description of the files are listed here. Below are detailed descriptions and links to example input files.

$ pyinseq -i <input file> -s <sample file> -g <genbank file> -e <experiment name>

-i / --input

Illumina reads file in FASTQ or gzipped-FASTQ format.

-s / --samples

Sample list where each line is the sample_name <tab> barcode (4bp).

-g / --genome

Concatenated GenBank File for all contigs for the genome.

-e / --experiment

All results files will be created in a subfolder with this name.

Snakemake

--get_default_config

Creates a default configuration file for running pyinseq

--config_format

File format for configuration file (yaml or json)

-c/ --config

Configuration file for running a pyinseq workflow. Every other argument will be ignored.

--test

Runs pytest on installed pyinseq software.

Optional arguments

--gff

Generate genome output file in gff3 format.

-d / --disruption

Five-prime fraction of gene (0.0 - 1.0) that must be disrupted for the hit to be counted in the summary_gene_table. Often insertions at the 3' end of a gene do not disrupt function so it may be of interest to run the pipeline with a disruption value of 0.8 or 0.9. [default: 0.9]

--min_count

Minimum number of reads per insertion site. [default: 3]

--max_ratio

Maximum ratio of left:right or right:left reads per insertion site. [default: 10]

--barcode_length

Length of barcode index that is expected in samples. [default: 4]

--transposon_seq

DNA sequence of transposon that flanks reads. [default: 'ACAGGTTG']

-t / --threads

Number of cores to use for execution [default: the CPU count of the computer]

--snakemake_params

Additional parameters that will get passed to snakemake.

Output description

File	Description
`<experiment name>-config.yml`	Configuration file with run parameters
`results/summary_gene_table.txt`	summary for entire experiment (values in counts-per-million, cpm)
`results/<sample>_sites.txt` (for each sample)	Counts of each insertion in each sample
`results/<sample>_genes.txt` (for each sample)	Counts of each insertion mapped to genes
`results/<sample>_bowtie.txt` (for each sample)	Bowtie mapping results
`results/<sample>_trimmed.fastq` (for each sample)	Demultiplexed fastq reads trimmed for the chromosome sequence only
`results/log.txt`	text printed to console
`results/samples_info_yml`	basics stats for each sample
`results/genome_lookup/genome.fna`	genome fasta nucleotide file
`results/genome_lookup/genome.ftt`	genome feature table
bowtie indexes	index files created from genome by bowtie
`results/raw_data/<sample>.fastq` (for each sample)	demultiplexed files for each sample/barcode
`results/raw_data/_other.fastq`	demultiplexed files for unrecognized barcodes

Notes on output

T50

The minimum number of transposon insertion sites in the sample that account for at least 50% of the samples's reads. Used as a crude measure to detect bottlenecks, when comparing output T50 to the library input T50, or when comparing biological or technical replicates.

samples_info.yml

Contains basic information of each sample in the experiment.

log.txt

Log file from pyinseq, including the messages from the snakemake execution. This includes the order of steps taken during the pyinseq execution.

<experiment name>-config.yaml

Configuration file with run parameters.

Background

pyinseq was inspired by the software published by Goodman et al. (2011). There are a number of differences, some of which are noted here. Most are fairly superficial in that they are intended to increase automation and reproducibility but do not materially affect the results. One exception is the first point, which does affect the resulting output.

As was conducted in Brooks et al. (2014) transposon site data are normalized across all replicons (chromosomes/plasmids/contigs) for calculation of a counts-per-million (cpm) per sample.
The user's file of reads is demultiplexed into separate .fastq files for each barcoded read.
The user provides a GenBank file, and pyinseq generates a fasta nucleotide file (.fna) and gene feature table (.ftt) that are formatted and named properly for bowtie.
The gene feature table (.ftt) is comparable to the protein feature table (.ptt) but it also includes RNA genes.
If pyinseq is installed from conda, then bowtie is installed from conda as a dependency.
At the end of the analysis, results are aggregated into a tab-delimited table and sample info is summarized.
pyinseq is written primarily in Python and uses snakemake as the workflow manager. You probably figured that out already.

User guide

Input files description

Illumina sequencing reads [-i] for the experiment. File can be uncompressed (.fastq or .fq) or gzip-compressed (.fastq.gz or .fq.gz).

Example input file

@DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0:
GAAGCGACACCGACGACGGTAACAGGTTGGATGATAAGTCCCCGGTCTTCG
+
CCCFFFDDHHHHHCGHHHIJDHIIJJFHHJIJJJJIJJDHIJJHIAGIJJJ
...

Sample file [-s] describing the sample names and barcodes. Sample names should be restricted to letters, numbers, dash (-), and underscore (_), with a tab between the sample and the barcode in each row of a text file. It is recommended that the file be prepared in a text editor to ensure that additional hidden characters are not introduced. BBEdit is one option for Mac.

Microsoft Excel can export tab-delimited files (.tsv), but do not use Microsoft Word for this purpose.

Example sample file

E001_01 GAAG
E001_02 CTTT

GenBank file [-g] listing the features and DNA sequence for the organism. If the organism has multiple chromosomes/contigs in the sequence the file they should be concatenated into a single file. Ensure that the double slash // at the end of the file remains to separate each contig.

Files from NCBI GenBank often work where the corresponding files from NCBI RefSeq do not. Feel free to contact us with any questions here.

Example GenBank file

LOCUS       CP000020             2897536 bp    DNA     circular BCT 02-APR-2008
DEFINITION  Vibrio fischeri ES114 chromosome I, complete sequence.
ACCESSION   CP000020
VERSION     CP000020.2  GI:171902228
KEYWORDS    .
SOURCE      Vibrio fischeri ES114
...
FEATURES             Location/Qualifiers
     source          1..2897536
                     /organism="Vibrio fischeri ES114"
                     /mol_type="genomic DNA"
                     /strain="ES114"
                     /db_xref="taxon:312309"
                     /chromosome="I"
     gene            complement(313..747)
                     /gene="mioC"
                     /locus_tag="VF_0001"
                     /old_locus_tag="VF0001"
     CDS             complement(313..747)
                     /dnas_title="FMN-binding protein MioC"
                     /gene="mioC"
                     /locus_tag="VF_0001"
                     /old_locus_tag="VF0001"
                     /codon_start=1
                     /transl_table=11
                     /product="FMN-binding protein MioC"
                     /protein_id="AAW84496.1"
                     /db_xref="GI:59478709"
                     /translation="MKKVSIITGSTLGGAEYVGDHLADLLEEMDFSTDIHNQPNLDDI
                     DIDSLWLLVVSTHGAGDYPDNIKPFIQQLESVTQPLSSVEFAVVAIGDSSYDTFCAAG
                     KSLQNTLKEHGAIEKYPLLEIDVTQNSIPEEPAELWLKQHIC"
...
ORIGIN
        1 aagatcactt aatatatata agatctttta aagagatctt ttattagatc tattatatag
       61 atcgtcgatc tctgtggata agtgataaat gatcaatagg atcatatact ttagatggat
      121 ccaaagttgt tatctttctt tgatcttcga tcggacagct tgaggacaaa agagttagtt
      181 atccacaagg ggggagggcg ttagatctta ttcaatggat aactataact tgatcactgg
      241 atctttctat agttatccac atagtaggta tcatctattt aataactttt atagatcgga
      301 caacactttt tattaacaaa tgtgttgttt tagccacaat tctgctggtt cttcagggat
      361 actattttga gttacatcta tctctaatag agggtacttt tcgatagcgc catgctcttt
      421 taaggtattt tgaagtgatt ttcctgctgc gcagaaagtg tcataacttg aatcaccgat
      481 agcaacaaca gcaaattcaa cgctcgataa tggctgagtc acgctttcta actgttgaat
      541 aaatggttta atgttgtcag ggtaatcacc agcaccgtga gttgatacaa cgagtaacca
      601 taagctatca atatcaatat catctaaatt tggttgatta tgaatatcgg tggaaaaatc
      661 catttcttct aataaatcag caagatggtc accaacatac tcagcaccac ctagagtgct
      721 tcctgtaata atagatactt ttttcatgaa tttatcctat aaaaatataa aaaatgggcc
      781 tacataggcc cattattaat cttattaata ttggttttat ttaccaatac agaatgaagt
...
//
LOCUS       CP000021             1330333 bp    DNA     circular BCT 02-APR-2008
DEFINITION  Vibrio fischeri ES114 chromosome II, complete sequence.
...

General Usage

Pyinseq uses snakemake to execute the following workflow:

pyinseq

Demultiplex a file of FASTQ reads.
Write separate trimmed versions of the files (no barcodes, no transposon sequence).
Map the trimmed reads to the genome.
Quantify insertions per site and per ORF in the genome.
Output summary gene table and report files describing the dataset.

The benefit of using snakemake is that it allows for parallel execution if more than one thread is provided. The modularity of snakemake enables future feature implementation.

Run Parameters (Config File)

pyinseq writes a configuration file (<experiment name>-config.yaml) into the working directory, which holds the parameters for the run. For the example command below, the pyinseq-config.yaml file will store the parameters used for this run.

$ pyinseq -i reads.fastq -s samples.txt -g genome.gbk -e pyinseq --threads 4

Example Config File

$ cat demo-run-config.yaml
snakemake_params: []
barcode_length: 4
command: pyinseq
config: false
config_format: yaml
disruption: 0.9
experiment: demo-run
genome: genome.gbk
get_default_config: false
gff3: false
input: reads.fastq.gz
max_ratio: 10
min_count: 3
samples: samples.txt
threads: 4
transposon_seq: ACAGGTTG

If you installed conda in your terminal, you can use the option --use-conda which will install bowtie during the executuion of the workflow.

$ pyinseq -i reads.fastq -s samples.txt -g genome.gbk -e demo-run --threads 4 --snakemake_params --use-conda

You can also get a default configuration file by using --get_default_config and modify it using a text-editor.

$ pyinseq --get_default_config
$ ls
default-config-pyinseq.yaml

Make sure that all file paths in the configuration are correct

Specialized tasks

These commands are useful when combining samples from multiple Illumina runs in a single pyinseq analysis. Both commands will also create a configuration file specific to each subcommand.

pyinseq demultiplex

Demultiplex a file of Illumina reads.
Writes separate trimmed versions of the files (no barcode, no transposon sequence) unless the optional --notrim flag is added.

$ pyinseq demultiplex -i <input file> -s <sample file> -e <experiment name>

pyinseq genomeprep

Prepare the fasta nucleotide (.fna) and feature table (.ftt) files from a GenBank file.
Is a good quick test to run on new GenBank files.
Generates bowtie indexes unless the --noindex flag is added.
Optionally create GFF3 file with the -gff flag (for use in other programs).

$ pyinseq genomeprep -g <genbank file> -e <experiment name>

Similar to pyinseq main usage, you can run these commands using a configuration file and pass additional snakemake options.

Installation

Requirements

pyinseq was written and tested using a MacOS (or Linux-based) operating system.

MacOS or Linux-based operating systems.
Python 3.7
bowtie (v1.0.0)
snakemake

pyinseq has not being tested on Windows operating systems, but as of Windows 10 there is support for terminals with Ubuntu

Also note that pyinseq uses bowtie; not bowtie2, which is a different software.

Installation in a conda environment (recommended)

Conda is a command-line package manager that can create virtual environments with all the necessary dependencies to run pyinseq. You can acquire conda by installing Anaconda and all the packages it brings, or by installing its lightweight version called miniconda. We recommend miniconda since it installs a minimal number packages, and is actually all you need to run pyinseq

Once conda is installed, you can verify it by running:

$ conda --help

Installing mamba

mamba is now required for snakemake. Install it system-wide with this command. The subsequent steps are shown using conda since that is more common for the scientific audience, but can largely be accomplished in mamba if the user desires.

conda install -n base -c conda-forge mamba

Creating a virtual environment

A virtual environment is an isolated computational space where you can install dependencies and software without affecting the base operating system's configuration. We can use conda to create a virtual environment with python 3.7

$ conda create -n pyinseq-py37 python=3.7

To activate your environment:

$ conda activate pyinseq-py37

You should see the name of your environment surrounded by parentheses in your terminal prompt.

(pyinseq-py37) $

Install `bowtie`:

(pyinseq-py37) $ conda install -c bioconda bowtie

Install `pyinseq`:

Install the stable version:

(pyinseq-py37) $ pip install pyinseq

Or, install the most recent version from GitHub:

(pyinseq-py37) $ pip install git+https://github.com/mjmlab/pyinseq

Verify your installation with --test

(pyinseq-py37) $ pyinseq --test


Verify that pyinseq installed correctly by running:

```bash
(pyinseq-py37) $ pyinseq --help
2021-05-26 13:10 - INFO - pyinseq - Process command line arguments
usage: pyinseq [-h] [--get_default_config] [--config_format CONFIG_FORMAT]
               [-c CONFIG] [-t THREADS] [--snakemake_params ...] [-v]
               [-i INPUT] [-s SAMPLES] [-e EXPERIMENT] [-g GENOME]
               [-d DISRUPTION] [--min_count MIN_COUNT] [--max_ratio MAX_RATIO]
               [--barcode_length BARCODE_LENGTH]
               [--transposon_seq TRANSPOSON_SEQ] [--gff3]
               {demultiplex,genomeprep} ...
...

Now you are ready to run pyinseq!

Testing

You can test your installation of pyinseq by using the option --test.

(pyinseq-py37) $ pyinseq --test

If all tests pass then you are good to go!

FAQ

Why do I get errors in processing the GenBank file?

Ensure that the file is in GenBank and not RefSeq format.

How do I uninstall pyinseq?

You can do this two ways:

uninstalling pyinseq from the virtual environment

pip uninstall pyinseq

or completely remove the conda virtual environment.

conda env remove -n pyinseq-py37

Note: will need to conda deactivate first to leave the environment.

How can I notify of an issue with `pyinseq`

Please use the GitHub Issues.

License

Pyinseq is an open-source software licensed under BSD-3.

pyinseq's People

Contributors

Stargazers

Watchers

Forkers

rebeldroid12 ellarotman camillescott biobenkj eburgoswisc mandel01

pyinseq's Issues

`filterSortCounts()` sorting as strings

filterSortCounts() appears to be sorting nucleotides as strings (i.e., 150000 before 180). This is not a problem when mapping to the gene. This is a noteworthy bug when looking at output files so I'm marking it here. However, I think the whole architecture of the pipeline needs to be redesigned and therefore this should get examined at that point.

Automatically deploy docs to gh-pages

Consider travis-sphinx

Running the pipeline without sample list.

Hi,

Is it possible to run the pyinseq pipeline without sample list? My TnSeq data is already demultiplexed after the sequencing run so I cannot provide additional barcode sequences.

Bests,
Jan

Summary of results and logging

Need overall strategy for logging results into a text file and into the terminal.

Items that need to be documented or that need better documentation

Issue #20 changed to focus on appearance. This issue is for content that needs (better) documentation.

Barcodes are not being filtered out properly

Currently all 4 bp barcodes are being carried through the pipeline. Only barcodes in the sample list should be analyzed. Check: assign.py

Improve the modular design of the pipeline

Typical use case is that a new set of Illumina reads arrives. They immediately need to be demultiplexed. In some cases samples from a single run are analyzed; in other cases a combination of old and new samples are compared for analysis. The current setup requires manual steps to first demultiplex the new data and to then select which samples for the combined analysis. Goal of this issue is to further modularize the pipeline to fully separate demultiplexing, mapping, and analysis. Although these are somewhat separate currently, the code to run them is still too intermingled. For example, they use one strain list that is not passed from function to function but instead just runs in a long script. Goal here is to have these as completely separate functions in pyinseq.py.

Here is a scaffold I have started playing with on a separate branch called modular (will push soon).

For a basic analysis in which all samples are from the same run, first the sample list and paths to the relevant files will be written to a config file (by the function tentatively called pipeline_config()). This will be copied into the beginning of a log for the run as mentioned previously (https://github.com/mandel01/pyinseq/issues/7).

The samples will then be demultiplexed and the results of that will be written to separate files that accompany each demultiplexed read (and added to the log).

The mapping will be conducted on each demultiplexed file against the genome file specified in the initial configuration file. The results of the mapping will be added as separate files with the mapping outputs (and added to the log).

Finally, analyses will be conducted as specified in the config file.

This framework should form the basis for specifying samples for analysis and for combining samples from different Illumina runs in future workflows.

def pipeline_config():
    """organize list of samples; write dictionary-like file (yaml?) that other functions will read."""
    pass

def pipeline_demultiplex():
    """demultiplex samples"""
    pass

def pipeline_mapping():
    """inseq mapping"""
    pass

def pipeline_analysis():
    """analysis"""
    pass

New directory generator based on experiment name

Would it be worth automating a way to add like a '_1 or _2' (or even the date?) to the end of the experiment directory? Just warn the user that it exists and that the new one was created.

Unknown argument(s): {'colour': 'green'}

(pyinseq) naseersangwan@Naseers-MBP Bt_5951 % pyinseq -i GTAC-134320-RG_1.HL2C3DSX5_CCGCGGTTAT-ACAGCGCTAG_L002_R1.fastq -s bla.txt -g PROKKA_09032022.gbk -e BLA
2023-03-01 20:02 - INFO - pyinseq - Process command line arguments
2023-03-01 20:02 - INFO - utils - Writing configuration file from provided arguments
2023-03-01 20:02 - INFO - utils - Directory BLA already exists. Snakemake will create files that are missing.
2023-03-01 20:02 - INFO - pyinseq - Workflow selected: pyinseq
2023-03-01 20:02 - INFO - pyinseq - ***** BEGIN PYINSEQ WORKFLOW *****
(SNAKEMAKE INFO)
(SNAKEMAKE INFO)
2023-03-01 20:02 - INFO - demultiplex - Preparing to demultiplex reads
Error in job demultiplex while creating output files results/BLA/raw_data/Sample_1.fastq, results/BLA/Sample_1_trimmed.fastq.
RuleException:
Warning in line 84 of /Users/naseersangwan/Desktop/Conda/envs/pyinseq/lib/python3.6/site-packages/pyinseq/workflows/PyinseqWorkflow/Snakefile:
Unknown argument(s): {'colour': 'green'}
File "/Users/naseersangwan/Desktop/Conda/envs/pyinseq/lib/python3.6/site-packages/pyinseq/workflows/PyinseqWorkflow/Snakefile", line 84, in __rule_demultiplex
File "/Users/naseersangwan/Desktop/Conda/envs/pyinseq/lib/python3.6/site-packages/pyinseq/demultiplex.py", line 62, in demultiplex_fastq
File "/Users/naseersangwan/Desktop/Conda/envs/pyinseq/lib/python3.6/site-packages/pyinseq/logger.py", line 78, in init
File "/Users/naseersangwan/Desktop/Conda/envs/pyinseq/lib/python3.6/site-packages/tqdm-4.7.2-py3.6.egg/tqdm/_tqdm.py", line 427, in init
File "/Users/naseersangwan/Desktop/Conda/envs/pyinseq/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Exiting because a job execution failed. Look above for error message
2023-03-01 20:02 - INFO - pyinseq - ***** PYINSEQ WORKFLOW FINISHED *****

Re-running previously demultiplexed samples

Add a couple new options (--checkpoint or --restartdir?) to be able to go into the experiment folder and grab the samples.yml file for paths to demultiplexed data

Overall optimization

There are (at least!) two overall issues needed for optimization. First is more efficient file reading and writing. Second is a better overall plan for not going over the same data multiple times (likely involving numpy and/or pandas).

Will return to both of these once the full pipeline is available here and can be run with test data and timed.

Use Exception Class for Pyinseq specific errors

Many errors that Pyinseq outputs are specific to the computational workflow. To aid in debugging and development, a separate python class should be written that will inherit from Exception.

three_primeness math

The -d parameter filters by the location of the insertion within the gene.

Here is the current relevant code in process_mapping.py

if strand == "+":
    three_primeness = (nucleotide - start) / (end - start)
if strand == "-":
    three_primeness = (end - nucleotide) / (end - start)

First problem: denominator should be the length of the gene (i.e., 1 more than the current calculation). Suggested solution:

if strand == "+":
    three_primeness = (nucleotide - start) / length
else:
    three_primeness = (end - nucleotide) / length

Second problem: See example below for a transposon insertion at the 4th bp of a 9 bp gene:

+ strand gene:

5'-ATGTACTAG
      ^

^ first nt in the TA site

Current math = ((3 + start) - start) / 9 = 3/9. However, this should be 4/9, so need to add 1.

What about - strand genes? Note this is the same sequence just flipped to the other strand.

- strand gene:
5'-CTAGTACAT    (gene starts over here)
       ^

Current math = ((end - (end - 4) / 9) = 4/9. If we look at the sample gene and count from the 5' end (i.e., the reverse complement of the ATG on the right side), then it should be 5/9. However, for the sake of biology, we want insertions in the same position to be called the same way whether the gene is on the + or - strand. Therefore, I think that 4/9 is the right call here.

Proposal is for the insertion nucleotide to continue to represent the first position of a TA site (i.e., lowest genome coordinate) regardless of what direction the gene(s) are.

5'-TA
3'-AT
   ^
Always call the "first" nucleotide regardless of gene annotations.

However, when we are processing the site data to collect gene-level data, we will use the nt in the TA site that is closest to the 5' end of the gene.

+ strand gene:
5'-ATGTACTAG
      ^
      *

- strand gene:
5'-CTAGTACAT
       ^*

^ = site-level insertion point
* = gene-level three-primeness call

Therefore... proposed solution:

if strand == "+":
    three_primeness = (1 + nucleotide - start) / length
else:
    three_primeness = (end - nucleotide) / length

pip install for requirements.txt error

Running:
$ pip install requirements.txt
Collecting requirements.txt
Could not find a version that satisfies the requirement requirements.txt (from versions: )
No matching distribution found for requirements.txt

Results in the above error. sudo does not fix the issue either.

Handling of reads that begin with TA site

While analyzing OSU data, we noticed that this Tn-Seq library contains reads that begin with TA. We believe this presents an issue when pyinseq map the bowtie alignments to the sites sincepyinseq was designed for reads that end with TA sites. Here is an example of how two test reads map to the genome:

5 - AGTTGAACCTCGTCGTA - 3
5 - TCTCTTCAATGAAGTTA - 3

(AGTTGAACCTCGTCG) [TA]   ACTTCATTGAAGAGA
 TCAACTTGGAGCAGC  [AT]  (TGAAGTAACTTCTCT)

Bowtie reports the insertion start location (0-based) for the reads and pyinseq calculates the coordinate of the T in TA site with:

(+) insertion_NT = insertion_NT + read_length - 1
(-) insertion_NT = insertion_NT + 1

This algorithm holds if reverse alignment is located to the right of TA site or if forward alignment is on the left.

So for these site mappings:

Read 1 (+) : insertion_NT = 13407 + 17 - 1 = 13423
Read 2 (-): insertion_NT = 13422 + 1 = 13423

For both reads, TA site index would be 13423, which pyinseq uses to count left and right reads that map around this site.

However, when the reads start TA site, the code used to find this coordinate does not work.

Example using reverse complement of same reads:

5 - TACGACGAGGTTCAACT - 3
5 - TAACTTCATTGAAGAGA - 3

 AGTTGAACCTCGTCG  [TA]  (ACTTCATTGAAGAGA)
(TCAACTTGGAGCAGC) [AT]   TGAAGTAACTTCTCT

Read 1 is now aligned to the reverse strand, but its insertion coordinate (which points to the beginning of reading) does not flank the targeted TA site

Read 1 (-) : insertion_NT = 13407 + 1 = 13408
Read 2 (+): insertion_NT = 13422 + 17 - 1 = 13438

Sites file from a stable pyinseq run on reverse-complement fastq files confirms this because it identifies these two locations as independent sites, but in reality, they are not. In addition, sites have either only left or right flanking reads.

$ cat E001_01_reversed_sites.txt
CP000020	13408	0	11	11	43650.793650793654
CP000020	13438	1	0	1	3968.253968253968

Github displays wrong language

Github lists pyinseq as an HTML project (in searches and in the "bar" at the top of the main page) due to the documentation in the packaged bowtie files. According to this link we should "consider moving the files into one of the paths for vendored code, or use the manual overrides feature to ignore them."

How about changing the packages/ directory to be named /dependencies ? Is that enough?

Would also need to tell bowtie where to find itself in config.py.

Documentation

It would be ideal to have a user tutorial/development documentation setup so that we can quickly sort out who is working on what, progress, etc. Github also renders IPython notebooks now, so we could use that. Otherwise I suggest using readthedocs.org.

I like how the khmer project is set up.

See: http://khmer.readthedocs.org/en/v1.4.1/

Thoughts?

Writing more informative tests and reorganizing the repo so tests can query individual functions.

[This isn't as much as issue as some notes that are relevant for this repo. Will put into another form later!]

First, looking at coverage to guide where tests are needed:

$ coverage run scripts/pyinseq.py -i data/example/example01.fastq -s data/example/example01.txt -g data/example/ES114v2.gb -e example01

...

$ coverage report
Name                        Stmts   Miss  Cover
-----------------------------------------------
scripts/config.py              14      3    79%
scripts/demultiplex.py         98     26    73%
scripts/gbkconvert.py         106     10    91%
scripts/mapReads.py            29      2    93%
scripts/processMapping.py     101      4    96%
scripts/pyinseq.py            129      9    93%
scripts/utils.py               20      5    75%
-----------------------------------------------
TOTAL                         497     59    88%

Improve appearance of documentation

e.g. File formats (gbk and how it can have multiple contigs; ftt format).

Need pypi description

https://pypi.org/project/pyinseq/

Code testing and test datasets

Generate unit tests and attempt to catch potential errors within functions. Download test data from GEO?

Summary gene table output is incorrect, only includes last entry for each gene from *genes.txt

The summary_gene_table.txt output is not tabulating all of the relevant results from the *genes.txt files. In particular, the totals for the columns should be close to 1E6 (total of the counts per million, minus intergenic regions and regions excluded by -d). However, in some cases the total was much lower, ~15,000. It appears that instead of tabulating results in the summary gene table, only the last entry for each gene from each *genes.txt is being listed. For example:

SPL01_genes.txt:

contig	nucleotide	left_counts	right_counts	total_counts	cpm	three_primeness	locus_tag
CM001400	2279	11	6	17	3.611702084716934	0.996309963099631	VFSR5_0003
CM001400	4041	2	1	3	0.6373591914206354	0.4573643410852713	VFSR5_0004
CM001400	4919	6	2	8	1.6996245104550278	0.8577235772357723	VFSR5_0007
CM001400	4921	100	141	241	51.20118837745771	0.8550135501355014	VFSR5_0007
CM001400	5390	8	4	12	2.5494367656825414	0.21951219512195122	VFSR5_0007
CM001400	5833	28	5	33	7.010951105626989	0.5758928571428571	VFSR5_0008
CM001400	5872	6	5	11	2.336983701875663	0.5178571428571429	VFSR5_0008
CM001400	5907	11	5	16	3.3992490209100557	0.46577380952380953	VFSR5_0008
CM001400	6071	2	3	5	1.0622653190343923	0.22172619047619047	VFSR5_0008
CM001400	6176	6	1	7	1.4871714466481492	0.06547619047619048	VFSR5_0008

summary_gene_table:

Synonym	Code	COG	Product	SPL01
VFSR5_0001	-	-	FMN-binding protein MioC	0
VFSR5_0002	-	-	tRNA modification GTPase TrmE	0
VFSR5_0003	-	-	membrane protein insertase	3.61170208
VFSR5_0004	-	-	YidD	0.63735919
VFSR5_0005	-	-	ribonuclease P	0
VFSR5_0006	-	-	50S ribosomal protein L34	0
VFSR5_0007	-	-	cystine transport ATP-binding protein	2.54943677
VFSR5_0008	-	-	cystine transport system permease	1.48717145

Note that the genes with no entries or only one entry are correct. However, VFSR5_0007 should sum the total from the 3 entries (1.70, 51.20, 2.55); instead only 2.55 is reported in the table. Similar for VFSR5_0008.

Add dev branch

On my 4th Travis CI build and with my commits last week that should have been squashed first, I have made a mess of the master branch. I'm starting a dev branch for unfinished work and will work on having the master branch be a more linear record from here out.

Excel-generated tsv files don't work for demultiplex.py

Only the first barcode was read in.

Add to bioconda

disruption fraction (i.e., threePrimeness) is not working

Specifying the command line argument -d 0.9 should map all hits but then only count those for a particular gene if they fall in the 5'-most 0.9 (i.e., 90%) of the gene. This is not working right now.

Images for README.md

Initial thoughts for contributors guide.

First should look at how others do this, but wanted to get these thoughts down as I look through others' Python code.

Python PEP8 style -- not fanatical, but makes the code more readable and easier for others to contribute and debug.
Single quote for strings except for docstrings.
4 spaces (no tabs) - Python 3 requires consistency here.
Add documentation for a feature as you commit the feature.

CPM are binning at 10/100/1000

In the first few runs have not seen a gradual distribution of normalized counts (CPM), instead have samples binning near 10, 100, or 1000. Is this a float() error perhaps? Better tests (Issue https://github.com/mandel01/pyinseq/issues/3) should help address this.

File reading in demultiplex is slow

Reading fastq files into pyinseq is one of the most slow and memory-heavy processes. However, using yield keyword allows us to create a generator that provides a line by line loop without storing it into virtual memory. This is also related to using Tqdm progress bar since we need a quick way to count lines for setting progression timeline.

Bundle bowtie directly in the package

Installation instructions for snakelike need to include mamba

Rework the mapping so that it scales to the size of the real datasets

The example01_bowtieOutput.txt file contains data like this.

$ head example01/temp/example01_bowtieOutput.txt 
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   +   CP000020    11136   CGACACCGACGACGGTA   FFDDHHHHHCGHHHIJD   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   -   CP000020    11151   TACAGGTCTTCACCACA   DJIHHHGCHHHHHDDFF   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   -   CP000020    13422   TAACTTCATTGAAGAGA   DJIHHHGCHHHHHDDFF   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   +   CP000020    13407   AGTTGAACCTCGTCGTA   FFDDHHHHHCGHHHIJD   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   +   CP000020    13572   CAATTTATCTCAACTTA   FFDDHHHHHCGHHHIJD   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   -   CP000020    13587   TAGAATTTAGATATATG   DJIHHHGCHHHHHDDFF   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   +   CP000020    13572   CAATTTATCTCAACTTA   FFDDHHHHHCGHHHIJD   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   -   CP000020    13587   TAGAATTTAGATATATG   DJIHHHGCHHHHHDDFF   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   +   CP000020    13572   CAATTTATCTCAACTTA   FFDDHHHHHCGHHHIJD   0   
DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0://example01:GAAG:17_Y_I   -   CP000020    13587   TAGAATTTAGATATATG   DJIHHHGCHHHHHDDFF   0

The subsequent steps count up the insertions and do some basic math on them (contig CP000020, nucleotide 11136, orientation +). Goal of this issue is to find a much better way to do this simple task! Currently all of the barcoded samples sit together, but I am working to split that so that each barcode is handled separately (https://github.com/mandel01/pyinseq/issues/13).

Here the goal is to simplify what is currently a daisy-chained set of function calls in processMapping.py. The functions go over the data way many times. Need to streamline this.

A simple first step might be to rework how the table of genome features is handled. Right now that table is rebuilt on every feature lookup (multifasta2dict(fna)). Perhaps this should be built up front and then retained in memory. This is a dictionary but conceptually is a table of 5000 rows and less than 15 columns. I will do some reading to see if I should consider pandas here; could name the columns in the dataframe and then refer to them as needed.

test_pyinseq_genomeprep_script doesn't seem to work

Test passes even if the comparison directory is removed

License

I suggest either a CC-BY 4.0 license or another completely open license (e.g. BSD) for increasing the user base, more eyes on the code to make it better (once it goes public) etc.

See: http://ivory.idyll.org/blog/2015-on-licensing-in-bioinformatics.html

Improve testing

Currently we have some very basic tests that test whether the output of the example run worked. We need a much more robust testing setup.

In general:

Currently the tests only test the output. I think that tests should test the exact functions!! Getting this done right would allow us to set up TravisCI for the repo.
There are many ways to screw up the code and have it pass these simple tests. Below I'll start brainstorming some of these ways to inform future tests.

Also, we should consider moving from unittest to pytest as many new projects seem to be using it and the asserts seem quite easy to write and read.

ReadTheDocs documentation build is failing

Currently we use Sphinx, but let's port them to MkDocs since we typically write in Markdown. Here are some links...
https://www.mkdocs.org
https://mkdocs-alabaster.ale.sh

Process reads file (.fastq) by demultiplexing

Rework beginning of pipeline. User provides the experiment name and the reads file. First demultiplex into individual samples to be processed separately. Using the sample file that connects sample names to barcodes, demultiplex the reads file into:

directory (user provided the experiment name) = (pyinseq)/samples/experiment

If folder already exists then abort and notify user.
sample names should not contain spaces or special characters. If they do notify user and have them resubmit (is this already in the demultiplex.py barcodes_prep() code?)
some reads will not match any barcode - put these with the file prefix _other
instead of my slow code for going through .fastq files, considering using other code. Consider adapting from:
- screed
- khmer fastq-to-fasta

File directory structure will look something like this (with multiple experiments). The user could then proceed to analyze samples from just this experiment or across multiple experiments. Major advantage is that downstream steps can grab & process one file at a time.

Question - store as gzipped or not? I expect this will be often run on a laptop, so it likely makes sense to compromise on speed to save on storage. May want to revisit this or give an option here.

No module named 'importlib.metadata'

Hi I am trying to run, the below but getting the error, please help

% pyinseq -i GTAC-134320-RG_1.HL2C3DSX5_CCGCGGTTAT-ACAGCGCTAG_L002_R1.fastq -s bla.txt -g PROKKA_09032022.gbk -e BLA
2023-03-01 18:05 - INFO - pyinseq - Process command line arguments
2023-03-01 18:05 - INFO - utils - Writing configuration file from provided arguments
2023-03-01 18:05 - INFO - utils - Directory BLA already exists. Snakemake will create files that are missing.
2023-03-01 18:05 - INFO - pyinseq - Workflow selected: pyinseq
2023-03-01 18:05 - INFO - pyinseq - ***** BEGIN PYINSEQ WORKFLOW *****
ModuleNotFoundError in file /Users/naseersangwan/Desktop/Conda/envs/pyinseq-py37/lib/python3.7/site-packages/pyinseq/workflows/PyinseqWorkflow/Snakefile, line 18:
No module named 'importlib.metadata'
File "/Users/naseersangwan/Desktop/Conda/envs/pyinseq-py37/lib/python3.7/site-packages/pyinseq/workflows/PyinseqWorkflow/Snakefile", line 18, in
File "/Users/naseersangwan/Desktop/Conda/envs/pyinseq-py37/lib/python3.7/site-packages/pyinseq/demultiplex.py", line 12, in
File "/Users/naseersangwan/Desktop/Conda/envs/pyinseq-py37/lib/python3.7/site-packages/screed/init.py", line 39, in

Matplotlib from conda missing baseline images when `pyinseq --test` is called

The --test option of pyinseq, which uses pytest in the rootdir, throws this exception when called:

$ pyinseq --test
2021-08-09 14:55 - INFO - pyinseq - Process command line arguments
=============================================================================================== test session starts ===============================================================================================
platform darwin -- Python 3.6.13, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/emanuelburgos/opt/miniconda3/envs/test_pyinseq_py36/lib/python3.6/site-packages
plugins: cov-2.12.1
collected 0 items / 1 error                                                                                                                                                                                       

===================================================================================================== ERRORS ======================================================================================================
__________________________________________________________________________________________ ERROR collecting test session __________________________________________________________________________________________
../importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:994: in _gcd_import
    ???
<frozen importlib._bootstrap>:971: in _find_and_load
    ???
<frozen importlib._bootstrap>:941: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:219: in _call_with_frames_removed
    ???
<frozen importlib._bootstrap>:994: in _gcd_import
    ???
<frozen importlib._bootstrap>:971: in _find_and_load
    ???
<frozen importlib._bootstrap>:955: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:665: in _load_unlocked
    ???
<frozen importlib._bootstrap_external>:678: in exec_module
    ???
<frozen importlib._bootstrap>:219: in _call_with_frames_removed
    ???
matplotlib/tests/__init__.py:7: in <module>
    'The baseline image directory does not exist. '
E   OSError: The baseline image directory does not exist. This is most likely because the test data is not installed. You may need to install matplotlib from source to get the test data.
============================================================================================= short test summary info =============================================================================================
ERROR  - OSError: The baseline image directory does not exist. This is most likely because the test data is not installed. You may need to install matplotlib from source to get the test data.

I believe this happens because calling pytest.main() does not correctly append the test directories to the PYTHONPATH. For now, you can clone the source code (or cd into site-packages/pyinseq in your python architecture) to verify the software. However, this will not cause issues since TravisCI testing frameworks are implemented in the pyinseq github

Bug in PyPi project description

Pyinseq 0.3.0 on Pypi has a bug in its project description that does not allow the logo to render properly. To fix this, the README should have a link that points to the Github repository's image which allows PyPi to find it. This fix should be included in pyinseq 0.3.1 including the removal of the disclaimer since it should be available on bioconda. Otherwise, this new version of the program will work just fine.