RNAEditingIndexer

A tool for calculating RNA editing levels from RNA seq data

Installation and Requirements

Docker

A docker file containing the tool is included in the package. We kindly recommend to use it if possible to avoid dependencies conflicts and configuration. For more in-depth instructions see Docker.README file in the Docs directory. The image is based on Biocontainer base image.

Dependencies

SAMtools - version 1.8 or higher (tested on 1.8)
bedtools - version 2.26 or higher (tested on 2.26)
bamUtils
Java - any recent version (with javac, i.e. a SDK)
Python 2.7 (a clean installation is sufficient)

OS Requirements

Right now the program supports only GNU/Linux operating systems (and probably any other POSIX OS)

(The binary executable was compiled using pyInstaller over a 64 bit Centos 6.10 operating system. Older and/or 32 bit based operating systems may not be able to run it properly - please kindly use the Docker in these cases.)

CPU and Memory

The program has low demand of system resources (CPU and memory) - only the default resource requirements of SAMtools and bedtools are needed (thay are ran with default CPU and memory parameters to generate the CMPileups). For the rest of the processing (after the creation of theCMPileups), the program demands very little. However the deafult thread number is high (and can be easily changed using command line parameters)

Disk Space

The installation requires a bit more than 12G of free disk space, almost all (~11.7G) of which is for the built-in resources (built-in genomes and tables which are not mandatory for running, see further details bellow for installation without downloading and running)

Local Installation

(Installation time for desktop computers should not exceed 15 minutes or so, downloading the data tables may take longer, depnding on internet connection)
Prior to installation, you need to ran a configuration bash script (configure.sh, see below). It includes tests for the various programs required, and initialization of variables for the installation. If the any of the tests fail (except for bamUtils) the configuration is aborted.

Any of the used paths (to resources directory and the programs) can be set at this stage, please run configure.sh -h to see all options.

NOTE: The installation will by default download the built-in genomes (unzipped) and tables (gzipped). This requires about 12G of disk space.

#change working dir to the installtion dir

cd ./RNAEditingIndexer

#configure installtion environmental variables

. ./configure.sh

make

Resources File

The installation creates a file named ResourcesPaths.ini at <InstallPath>/src/RNAEditingIndex/Configs (set with configure.sh) which specifies the default path to the required programs and data files (such as genomes and tables). Modify this file after installtion to change defaults (such as in the case of not downloading the data files)

Running

Simply run RNAEditingIndex -h to see full help.

An example for a simple run:

_InstallPath_/RNAEditingIndex -d _BAMs diretory_ -f Aligned.sortedByCoord.out.bam. -l _logs directory_ -o _cmpileup output directory_ -os _summery files directory_ --genome hg38

Typical runtime

Typical runtime, parallelization taken into account, is around the 20-30 min (for a 50 millions reads BAM) per sample on servers, could be up to four times as much on desktop computers, depending on BAMs sizes (i.e. coverage).

Logging and flags

Under the chosen logging directory a flags directory is created. This contains a flag file for each sample name processed (of the format <sample name>.flg. In order to re-run samples the flags belonging to the samples must be deleted or they will be ignored. This feature enables parallel running with several instances of the program and re-running with the same parameters only on a subset of the samples (e.g. failed to run ones). The logging directory also contains a main log (the name is EditingIndex.<timestamp>.log) including timestamps per (internal) command and sample processing (this is the place to check for progress and errors).

Inputs

Alignments

The input directory containing alignment (BAM) files. The directory can be nested (i.e. folders within folders), the program looks for the BAM files recursively.
Note: alignment should be unique. (non-unique alignemt may create unpredicted, algorithm dependent, biases)

Genome and Annotations

You can use any of the built-in genomes (and their corresponding annotations) without providing any additional paramters (using the --genome option). However any used resource (regions indexed BED, SNPs, gene annotations and expression levels, and genome) can be provided by the user instead. See help and documentations for details.

Outputs and Output Directories

Temporary Outputs - CMPileup and genome index files

CMPileups, pileup files converted into a numerical format (for more details see the full documentaion), are created in the directory specified under -o flag and unless specified otherwise (with the keep_cmpileup flag) will be deleted after processing due to their, usually, very large size. A genome index (with the suffix .jsd by default) is also created there (and deleted).

Summary file

A summary file is created in the directory specified by -os. The output is appended for each run, so that several instances of the program may be run with the same output file (creating a single joined output). For a full explanation of the output see the documentaion, but in a nutshell:

A2GEditingIndex is the signal (i.e. value) of the editing
C2TEditingIndex is the highest noise (in most cases)
(in verbose mode) use only lines where StrandDecidingMethod is "RefSeqThenMMSites" (in any organism with good genes annotations)

Test Run

To run the test please use the following command: <InstallPath>/RNAEditingIndex -d <InstallPath>/TestResources/BAMs -f _sampled_with_0.1.Aligned.sortedByCoord.out.bam.AluChr1Only.bam -l <your wanted logs dir> -o <wanted cmpileup output dir> -os <wanted summery dir> --genome hg38 -rb <InstallPath>/TestResources/AnnotationAndRegions/ucscHg38Alu.OnlyChr1.bed.gz --refseq <InstallPath>/TestResources/AnnotationAndRegions/ucscHg38RefSeqCurated.OnlyChr1.bed.gz --snps <InstallPath>/TestResources/AnnotationAndRegions/ucscHg38CommonGenomicSNPs150.OnlyChr1.bed.gz --genes_expression <InstallPath>/TestResources/AnnotationAndRegions/ucscHg38GTExGeneExpression.OnlyChr1.bed.gz --verbose --stranded --paired

Typical runtime should be within 10 min, reference results are in <InstallPath>/TestResources/CompareTo.

afonsoguerra / rnaeditingindexer Goto Github PK