cONTent is a tool-box allowing the analysis of ONT long-reads length and quality.
cONTent is composed of 3 sub-programs:
- extract : parse a read library (/!\ SHOULD BE A FASTQ FILE AND NOT A FASTQ.GZ) and extract each reads' id, length and average phred quality. Then results are saved as tab-separated file with a '.content' extension. The extracted information are per read identifier, length, and mean quality (phred score).
- distribution : subsample read-librar(y/ies) and plot reads' quality as a function of the reads' length. Also compute basic statistics for these two measurments. NB : if several libraries are provided, individual plot and statistics will be generated for each library in addition to a global plot and table.
- coverage : compute genome coverage using different length and quality cut-offs. Display the results as a heatmap. This program might be usefull to set minimal reads length and quality cut-off to reach a target genome coverage. NB : The output table only display rows for which the coverage obtained with these values of minimal reads' length and quality satisfies the required coverage.
Programs usages and ouputs are extensively described the 'Usage' section below.
It is advised to create a dedicated virtual environment (here we use Conda) to install cONTent. The following lines will
- create a python 3.10.4 conda environment named 'content_env'
- activate the 'content_env'
conda create -n content_env -y python=3.10.4
conda activate content_env
Go to the desired location and clone the repository.
cd <INSERT_PATH_OF_DESIRED_LOCATION>
git clone https://github.com/DjampaKozlowski/cONTent.git
cd cONTent/
cONTent rely on various scripts in python and C++ (information extraction from the fastq). You can install cONTent following two strategies :
- automatic
- manual
Execute the following line :
make install
This will :
- install the python dependencies and install cONTent as a python module
- create a directory named
build
incontent
- compile the C++ program
content/fastq_processor.cpp
and generate an executable stored incontent/build
.
echo "Installing python requirements & installing the content module"
pip install -r requirements.txt
pip install -e .
echo "Building the fastq parser"
mkdir -p content/build/
g++ -std=c++11 content/fastq_processor.cpp -o content/build/fastq_processor
Activate the environment (if not already done)
conda activate content_env
NB : parameters between brackets are optional parameters with default values.
Launch cONTent.py extract doing :
python cONTent.py extract [-h] -i INPUTFILEPATH -o OUTPUTFILEDIR
where:
- < INPUTFILEPATH > : input fastq file (/!\ SHOULD BE A FASTQ FILE AND NOT A FASTQ.GZ) path [mendatory]
- < OUTPUTFILEDIR > : output directory path [mendatory]. The output file will be named after the input file name with the extension
.content
.
Information extraction is a time consuming process. We advise if possible to run one process per library rather than concatenating the libraries and running one process. The other tools from the cONTent suite are designed to merge information from multiple cONTent.py extract output files
Launch cONTent.py distrib doing :
python cONTent.py distrib [-h] -input INPUTPATH -outdir OUTPUTPATH -prefix PREFIX [-fraction FRACTION]
- < INPUTPATH > : Input directory/file path. If the path point to a directory, all the '.content' files will be analysed (individually and together). [mendatory]
- < OUTPUTPATH > : Output directory path. Nb: if the ouput directory does not exist, it will be created along with its parent directories. If only a directory name is provided, the directory will be created in the execution directory. In any case,if the directory exist, it will be overwritten as well as the files it might contain (if files with the same name exist). [mendatory]
- < PREFIX > : Prefix used to name output files but also as plots' title (for global analysis). Spaces will be replaced with '_' in the files names [mendatory].
- < FRACTION > : fraction of reads to subsample per analysed library (distribution plot only). The biggest is the fraction, the longer the analysis will take. (default : 0.01)
Launch cONTent.py coverage doing :
python cONTent.py coverage [-h] -input INPUTPATH -outdir OUTPUTPATH -prefix PREFIX -genomesize GENOMESIZE [-n N] [-m M] [-mincoverage MINCOV] [-minquality MINQ] [-minlength MINL]
- < INPUTPATH > : Input directory/file path. If the path point to a directory, all the '.content' files will be analysed (individually and together). [mendatory]
- < OUTPUTPATH > : Output directory path. Nb: if the ouput directory does not exist, it will be created along with its parent directories. If only a directory name is provided, the directory will be created in the execution directory. In any case,if the directory exist, it will be overwritten as well as the files it might contain (if files with the same name exist). [mendatory]
- < PREFIX > : Prefix used to name output files but also as plots' title (for global analysis). Spaces will be replaced with '_' in the files names [mendatory].
- < GENOMESIZE > : Genome size (bp). Necessary to compute genome coverage
- < N > : Number of interval to create in reads length space (optimization plot only; used to compute coverage). Increasing n makes the coverage length/quality trade-off analysis more precise but also more time consuming. (default : 100).
- < M > : Number of interval to create in reads quality space (optimization plot only; used to compute coverage) Increasing n makes the coverage length/quality trade-offanalysis more precise but also more time consuming. (default : 100).
- < MINCOV > : Minimal coverage to represent (optimization plot only). (default : 20).
- < MINQ > : Minimal quality to represent (optimization plot only) (default : 12)
- < MINL > :Minimal length of sequences to represent (optimization plot only) (default : 1000)