Git Product home page Git Product logo

content's Introduction

cONTent

Description

cONTent is a tool-box allowing the analysis of ONT long-reads length and quality.

cONTent is composed of 3 sub-programs:

  • extract : parse a read library (/!\ SHOULD BE A FASTQ FILE AND NOT A FASTQ.GZ) and extract each reads' id, length and average phred quality. Then results are saved as tab-separated file with a '.content' extension. The extracted information are per read identifier, length, and mean quality (phred score).
  • distribution : subsample read-librar(y/ies) and plot reads' quality as a function of the reads' length. Also compute basic statistics for these two measurments. NB : if several libraries are provided, individual plot and statistics will be generated for each library in addition to a global plot and table.
  • coverage : compute genome coverage using different length and quality cut-offs. Display the results as a heatmap. This program might be usefull to set minimal reads length and quality cut-off to reach a target genome coverage. NB : The output table only display rows for which the coverage obtained with these values of minimal reads' length and quality satisfies the required coverage.

Programs usages and ouputs are extensively described the 'Usage' section below.

Installation guide

Create a dedicated virtual environment (OPTIONAL)

It is advised to create a dedicated virtual environment (here we use Conda) to install cONTent. The following lines will

  • create a python 3.10.4 conda environment named 'content_env'
  • activate the 'content_env'
conda create -n content_env -y python=3.10.4
conda activate content_env

Clone the github repository

Go to the desired location and clone the repository.

cd <INSERT_PATH_OF_DESIRED_LOCATION>
git clone https://github.com/DjampaKozlowski/cONTent.git
cd cONTent/

Install cONTent.

cONTent rely on various scripts in python and C++ (information extraction from the fastq). You can install cONTent following two strategies :

  • automatic
  • manual

Automatic installation

Execute the following line :

make install

This will :

  • install the python dependencies and install cONTent as a python module
  • create a directory named build in content
  • compile the C++ program content/fastq_processor.cpp and generate an executable stored in content/build.

Manual installation

echo "Installing python requirements & installing the content module"
pip install -r requirements.txt
pip install -e .
echo "Building the fastq parser"
mkdir -p content/build/
g++ -std=c++11 content/fastq_processor.cpp -o content/build/fastq_processor

Usage

Activate the environment (if not already done)

conda activate content_env

NB : parameters between brackets are optional parameters with default values.

extract

Launch cONTent.py extract doing :

python cONTent.py extract [-h] -i INPUTFILEPATH -o OUTPUTFILEDIR

where:

  • < INPUTFILEPATH > : input fastq file (/!\ SHOULD BE A FASTQ FILE AND NOT A FASTQ.GZ) path [mendatory]
  • < OUTPUTFILEDIR > : output directory path [mendatory]. The output file will be named after the input file name with the extension .content.

Information extraction is a time consuming process. We advise if possible to run one process per library rather than concatenating the libraries and running one process. The other tools from the cONTent suite are designed to merge information from multiple cONTent.py extract output files

distrib

Launch cONTent.py distrib doing :

python cONTent.py distrib [-h] -input INPUTPATH -outdir OUTPUTPATH -prefix PREFIX [-fraction FRACTION]
  • < INPUTPATH > : Input directory/file path. If the path point to a directory, all the '.content' files will be analysed (individually and together). [mendatory]
  • < OUTPUTPATH > : Output directory path. Nb: if the ouput directory does not exist, it will be created along with its parent directories. If only a directory name is provided, the directory will be created in the execution directory. In any case,if the directory exist, it will be overwritten as well as the files it might contain (if files with the same name exist). [mendatory]
  • < PREFIX > : Prefix used to name output files but also as plots' title (for global analysis). Spaces will be replaced with '_' in the files names [mendatory].
  • < FRACTION > : fraction of reads to subsample per analysed library (distribution plot only). The biggest is the fraction, the longer the analysis will take. (default : 0.01)

coverage

Launch cONTent.py coverage doing :

python cONTent.py coverage [-h] -input INPUTPATH -outdir OUTPUTPATH -prefix PREFIX -genomesize GENOMESIZE [-n N] [-m M] [-mincoverage MINCOV] [-minquality MINQ] [-minlength MINL]
  • < INPUTPATH > : Input directory/file path. If the path point to a directory, all the '.content' files will be analysed (individually and together). [mendatory]
  • < OUTPUTPATH > : Output directory path. Nb: if the ouput directory does not exist, it will be created along with its parent directories. If only a directory name is provided, the directory will be created in the execution directory. In any case,if the directory exist, it will be overwritten as well as the files it might contain (if files with the same name exist). [mendatory]
  • < PREFIX > : Prefix used to name output files but also as plots' title (for global analysis). Spaces will be replaced with '_' in the files names [mendatory].
  • < GENOMESIZE > : Genome size (bp). Necessary to compute genome coverage
  • < N > : Number of interval to create in reads length space (optimization plot only; used to compute coverage). Increasing n makes the coverage length/quality trade-off analysis more precise but also more time consuming. (default : 100).
  • < M > : Number of interval to create in reads quality space (optimization plot only; used to compute coverage) Increasing n makes the coverage length/quality trade-offanalysis more precise but also more time consuming. (default : 100).
  • < MINCOV > : Minimal coverage to represent (optimization plot only). (default : 20).
  • < MINQ > : Minimal quality to represent (optimization plot only) (default : 12)
  • < MINL > :Minimal length of sequences to represent (optimization plot only) (default : 1000)

content's People

Contributors

arthurpere avatar djampakozlowski avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

arthurpere

content's Issues

Error for coverage

Hello,

I have an error for the coverage part, this is my error :

Traceback (most recent call last):
  File "/home/tools/cONTent/bin/cONTent.py", line 14, in <module>
    import content.coverage as cvrg
  File "/home/tools/cONTent/content/coverage.py", line 6, in <module>
    sns.set_style("white")
NameError: name 'sns' is not defined. Did you mean: 'sys'?

There is a import seaborn as sns missing.

Regard,
Arthur

Too much memory

Hello,

I have a problem about the memory gestion from the extract part.

When i input one file of 11Go, the memory usage is really big for the file used, when i stop it, i was at 30G of RAM use from the extraction itself.

Regard,
Arthur

Warning for the plots

Hello,

I have a warning in the plots part of the program,

This is the one :

RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam figure.max_open_warning`).
  f = plt.figure(figsize=(height, height))

To solve this problem :

plt.close(fig)

Regards,
Arthur

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.