Git Product home page Git Product logo

sailfish-cir's Introduction

Introduction

Sailfish-cir is a computational tool to estimate the relative abundance of circular RNA transcripts from high-throughput RNA-seq data.

It accepts output of circRNA identification tools (CIRI, KNIFE, circRNA_finder) or a BED-format file to specify the reference set of circular RNA transcripts in RNA-seq data. Then, it transforms all circular transcripts to pseudo-linear transcripts. Finally, it estimates the expression of both linear and circular transcripts using Sailfish framework.

Prerequisites

The following three tools should be installed before running Sailfish-cir.

1. Sailfish
   Sailfish (http://www.cs.cmu.edu/~ckingsf/software/sailfish/) is used to quantify RNA expression. Sailfish 0.7.0 or above is recommended.

2. gffread
   gffread is part of Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/). This GFF/GTF parsing utility is used to extract sequence.

3. gffutils 
   gffutils (https://pypi.python.org/pypi/gffutils) is used to build a database for reference set of all RNA transcripts.

We assume these pre-installed tools are running under a Unix-like environment, and paths of gffread and sailfish binary executive should be added to your PATH variable.

Other than above third party tools, several data files are required for Sailfish-cir as well.

1. Raw sequencing reads file in fasta/fastq format;
2. A CIRI output file or a BED file;
3. Gene annotation file in .gtf format, and a single fasta file or multi-fasta files containing RNA sequences. 

Usage

python path/to/your/sailfish_cir.py -g path/to/your/genomic/sequence/foo.fa -a path/to/your/annotation/bar.gtf -1 path/to/your/reads/mate1.fastq -2 /path/to/your/reads/mate2.fastq -o /path/to/where/you/want/your/result -c /path/to/your/CIRI/output/file
-g  path to genomic sequence fasta file
-a  path to gene annotation file, ie, .gtf or .gff files
-r  path to single-end raw sequencing reads file.
-1  path to raw pair-end reads, mate 1
-2  path to raw pair-end reads, mate 2

-c  path to CIRI output file to specify circular RNA
--bed  path to bed file which contains circular RNA.
--circRNA_finder path to circRNA_finder output file.
--KNIFE_report_folder path to KNIFE output folder, make sure it has the subdirectory "circReads"

-o  output folder that contains the index built by sailfish and quantification results
-k  minimum match size used during sailfish quantification,   default is 21
--libtype   format string describing the library type of your reads. default is "IU", [read more on libtype of Sailfish](http://sailfish.readthedocs.org/en/master/library_type.html)
--mll mean library length, this option is to fix up the effective length.
-h/--help	print this help message

Output file

Sailfish-cir expression estimats can be found at a subdirectory named quant_circular under the path set by -o parameter.

Notes

  1. .gtf file

    In order to generate reference sequences of circular RNA transcripts, a database of your genomic annotation is needed. It is recommended to create the database file manually using the same filename with extension ".db" in the same directory. Since gffutils don't guess GTF file format, and GTF format was changed after GRCH37.75. It will be time-consuming when building the database using the same option for earlier gtf version.

  2. output file from circRNA identification tools

    This script accepts BED-format file or output files of circRNA identification tools (CIRI, circRNA_finder, KNIFE are supported) to specify the circular RNA transcripts.

    We also provide a small utility to convert those outputs to a BED-format file. usage as follows:

    python path/to/your/sailfish_cir.py convert_CIRI your/CIRI/output/file
    python path/to/your/sailfish_cir.py convert_KNIFE your/KNIFE/output/folder
    

    For CIRI, this will create a .bed file in the same folder. the 'name' (4th) column follows this "circular_transcript@host_gene" pattern. For KNIFE, this will create a "summarized_knife_junction.bed" under your KNIFE output folder. circRNA_finder output is actually BED-format.

    If you want to quantify circRNA from multiple circRNA detection results, convert those output files into BED format and merge them into a single .bed file, then use "--bed" option.

ChangeLog

Release v0.11

  • BED-file format support: use '--bed' option to input a bed file as the reference set of circular RNA transcripts
  • Convert utility: convert CIRI outputs into a BED-file format
  • Issue of effective length correction: fix the effective length correction when estimating circular RNA expression values in real rRNA-depleted RNA-seq data. Please use '--mll' to specify the estimated effective fragment length of RNA-seq data.
  • Add some filters to unclutter the GTF entries.

Release v0.10

  • the first version. Snapshot with documents is here.

Example files

Example files are available in MEGA

  1. Human gene Annotation file: hg19 gtf file OR binary database file by gffutils

  2. Genome sequence file: hg19 fasta file

  3. Sailfish (ver 0.9.0): Sailfish Linux binary

  4. RNA-seq simulation scripts simulation

  5. Simulated RNA-seq datasets 11 simulated datasets (uniform expresion distribution)

    one simulated dataset (empirical expression distribution)

    uniform sequencing error with error rate 0.01, script and other files

    uniform sequencing error with error rate 0.02, script and other files

    empirical error model: illumina 4, script and other files

    empirical error model: illumina 5, script and other files

    simulation size two fold, script and other files

    simulation size 4 fold, script and other files

    simulation using whole genome, script and other files

Reference

Musheng Li, Xueying Xie, Jing Zhou, Mengying Sheng, Xiaofeng Yin, Eun-A Ko, Tong Zhou* and Wanjun Gu*. Quantifying circular RNA expression from RNA-seq data using model-based framework. Bioinformatics, 2017, in press.

Contact

Musheng Li ([email protected])

sailfish-cir's People

Contributors

zerodel avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.