Prediction of RNA-RNA contacts from RIC-seq data. Developed by Sergei Margasyuk ([email protected]) and Dmitri Pervouchine ([email protected]).
This package contains a pipeline for prediction of RNA-RNA contacts from RIC-seq data (Cai et al., 2020).
Clone this repository to your local system, into the place where you want to perform the data analysis.
git clone https://github.com/pervouchine/RIC-contacts.git
cd RIC-contacts
Configure the workflow according to your needs via editing the files in the config/
folder. Adjust config.yaml
to configure the workflow execution, and samples.tsv
to specify your sample setup.
Install Snakemake using conda:
# install mamba package manager if you don't have it
conda install -n base -c conda-forge mamba
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation.
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using $N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
See the Snakemake documentation for further details.
To make a test run, type
make download
make test
The script will download a toy dataset (sample sheet, truncated fastq files, genome, and genome annotation confined to the first 100MB of chr1), unpack,
update the config file, and execute the pipeline. The output files in results/test_hg19/test/contacts
will be compared to those provided in the archive.
Download the RIC-seq files for HeLa cell line from GEO repository GSE127188. Download the control RNA-seq files from ENCODE consortium webpage.
The files are as follows:
RNASeq_HeLa_total_rep1:
- fastq/ENCFF000FOM.fastq
- fastq/ENCFF000FOV.fastq
RNASeq_HeLa_total_rep2:
- fastq/ENCFF000FOK.fastq
- fastq/ENCFF000FOY.fastq
RIC-seq_HeLa_rRNA_depleted_rep1:
- fastq/SRR8632820_1.fastq
- fastq/SRR8632820_2.fastq
RIC-seq_HeLa_rRNA_depleted_rep2:
- fastq/SRR8632821_1.fastq
- fastq/SRR8632821_2.fastq
The output of the pipeline consists of the following files:
results/{genome}/{project}/{sample}/contacts
is the list of contacts and their respective read counts in tsv format (columns 1-3 and 4-6 are the contacting coordinates, column 7 is read count).results/{genome}/{project}/views/global/contacts.bed
is the BED12 file with contacts on the same chromosome and length less than the threshold defined inconfig
. This file for HeLa experiment is available at 10.5281/zenodo.6511343.