cONTent

Description

cONTent is a tool-box allowing the analysis of ONT long-reads length and quality.

cONTent is composed of 3 sub-programs:

extract : parse a read library (/!\ SHOULD BE A FASTQ FILE AND NOT A FASTQ.GZ) and extract each reads' id, length and average phred quality. Then results are saved as tab-separated file with a '.content' extension. The extracted information are per read identifier, length, and mean quality (phred score).
distribution : subsample read-librar(y/ies) and plot reads' quality as a function of the reads' length. Also compute basic statistics for these two measurments. NB : if several libraries are provided, individual plot and statistics will be generated for each library in addition to a global plot and table.
coverage : compute genome coverage using different length and quality cut-offs. Display the results as a heatmap. This program might be usefull to set minimal reads length and quality cut-off to reach a target genome coverage. NB : The output table only display rows for which the coverage obtained with these values of minimal reads' length and quality satisfies the required coverage.

Programs usages and ouputs are extensively described the 'Usage' section below.

Installation guide

Create a dedicated virtual environment (OPTIONAL)

It is advised to create a dedicated virtual environment (here we use Conda) to install cONTent. The following lines will

create a python 3.10.4 conda environment named 'content_env'
activate the 'content_env'

conda create -n content_env -y python=3.10.4
conda activate content_env

Clone the github repository

Go to the desired location and clone the repository.

cd <INSERT_PATH_OF_DESIRED_LOCATION>
git clone https://github.com/DjampaKozlowski/cONTent.git
cd cONTent/

Install cONTent.

cONTent rely on various scripts in python and C++ (information extraction from the fastq). You can install cONTent following two strategies :

automatic
manual

Automatic installation

Execute the following line :

make install

This will :

install the python dependencies and install cONTent as a python module
create a directory named build in content
compile the C++ program content/fastq_processor.cpp and generate an executable stored in content/build.

Manual installation

echo "Installing python requirements & installing the content module"
pip install -r requirements.txt
pip install -e .
echo "Building the fastq parser"
mkdir -p content/build/
g++ -std=c++11 content/fastq_processor.cpp -o content/build/fastq_processor

Usage

Activate the environment (if not already done)

conda activate content_env

NB : parameters between brackets are optional parameters with default values.

extract

Launch cONTent.py extract doing :

python cONTent.py extract [-h] -i INPUTFILEPATH -o OUTPUTFILEDIR

where:

< INPUTFILEPATH > : input fastq file (/!\ SHOULD BE A FASTQ FILE AND NOT A FASTQ.GZ) path [mendatory]
< OUTPUTFILEDIR > : output directory path [mendatory]. The output file will be named after the input file name with the extension .content.

Information extraction is a time consuming process. We advise if possible to run one process per library rather than concatenating the libraries and running one process. The other tools from the cONTent suite are designed to merge information from multiple cONTent.py extract output files

distrib

Launch cONTent.py distrib doing :

python cONTent.py distrib [-h] -input INPUTPATH -outdir OUTPUTPATH -prefix PREFIX [-fraction FRACTION]

< INPUTPATH > : Input directory/file path. If the path point to a directory, all the '.content' files will be analysed (individually and together). [mendatory]
< OUTPUTPATH > : Output directory path. Nb: if the ouput directory does not exist, it will be created along with its parent directories. If only a directory name is provided, the directory will be created in the execution directory. In any case,if the directory exist, it will be overwritten as well as the files it might contain (if files with the same name exist). [mendatory]
< PREFIX > : Prefix used to name output files but also as plots' title (for global analysis). Spaces will be replaced with '_' in the files names [mendatory].
< FRACTION > : fraction of reads to subsample per analysed library (distribution plot only). The biggest is the fraction, the longer the analysis will take. (default : 0.01)

coverage

Launch cONTent.py coverage doing :

python cONTent.py coverage [-h] -input INPUTPATH -outdir OUTPUTPATH -prefix PREFIX -genomesize GENOMESIZE [-n N] [-m M] [-mincoverage MINCOV] [-minquality MINQ] [-minlength MINL]

< INPUTPATH > : Input directory/file path. If the path point to a directory, all the '.content' files will be analysed (individually and together). [mendatory]
< OUTPUTPATH > : Output directory path. Nb: if the ouput directory does not exist, it will be created along with its parent directories. If only a directory name is provided, the directory will be created in the execution directory. In any case,if the directory exist, it will be overwritten as well as the files it might contain (if files with the same name exist). [mendatory]
< PREFIX > : Prefix used to name output files but also as plots' title (for global analysis). Spaces will be replaced with '_' in the files names [mendatory].
< GENOMESIZE > : Genome size (bp). Necessary to compute genome coverage
< N > : Number of interval to create in reads length space (optimization plot only; used to compute coverage). Increasing n makes the coverage length/quality trade-off analysis more precise but also more time consuming. (default : 100).
< M > : Number of interval to create in reads quality space (optimization plot only; used to compute coverage) Increasing n makes the coverage length/quality trade-offanalysis more precise but also more time consuming. (default : 100).
< MINCOV > : Minimal coverage to represent (optimization plot only). (default : 20).
< MINQ > : Minimal quality to represent (optimization plot only) (default : 12)
< MINL > :Minimal length of sequences to represent (optimization plot only) (default : 1000)

djampakozlowski / content Goto Github PK

content's Introduction

cONTent

Description

Installation guide

Create a dedicated virtual environment (OPTIONAL)

Clone the github repository

Install cONTent.

Automatic installation

Manual installation

Usage

extract

distrib

coverage

content's People

Contributors

Stargazers

Watchers

Forkers

content's Issues

Recommend Projects

Recommend Topics

Recommend Org