Git Product home page Git Product logo

alternative-splicing's Introduction

alternative-splicing

This repository contains a number of useful scripts to analyze alternative splicing. Most of these scripts were used to process outputs of MISO. Therefore, these scripts are most useful if you have outputs from MISO to process. No guarantees using these scripts, but I hope you can glean something useful from them.

Brief highlights of this repository, with example workflows:

Installation

Packages and libraries required

Python

  • Python 2.6 or 2.7
  • SciPy
  • NumPy
  • Bio (for UniProt/SwissProt annotations)

R

  • R 3.0.0 or R 3.0.1. May also work with R 2.14 and 2.15.
  • ggplot2
  • extrafont
  • grid
  • RColorBrewer
  • gplots

Computational biology software

Cloning repository

cd dir_to_place_repository
git clone https://github.com/jakeyeung/alternative-splicing.git
## T-test to find differentially spliced events between two groups of samples
  1. Perform t-test using t_test_miso_output.py.
  2. Adjust p-values for multiple-test correction (Benjamini-Hochberg method) using adjust_pvalues.R
  3. Append gene names to output file
## 1. Perform t-test on MISO outputs

Inputs for t_test_miso_output.py

  • --group1_file: textfile containing sample names in group 1
  • --group2_file: textfile containing sample names in group 2
  • --main_dir: directory containing sample names (should match group 1 and group 2). Inside each directory contains MISO raw outputs (MISO events catalogued by chromosomes)
  • --output_directory: directory for output
  • --output_filename: name of output file
  • --min_counts: minimum junction read counts in a sample to be considered into a t-test. This value depends on the depth of your RNA-Seq data. Best practices suggest min_counts of 10 for a depth in the order of 50 million mapped read pairs

Example

cd alternative_splicing_scripts/miso_scripts
python t_test_miso_output.py --group1_file /path/to/group_1_samples.txt \ 
							 --group2_file /path/to/group_2_samples.txt \
							 --main_dir /path/containing/miso/outputs \
							 --output_directory output/path/file.txt \ 
							 --min_counts 10
## 2. Adjust p-values for multiple-test correction
  1. Input file
  2. Output file

Example

cd R
Rscript adjust_pvalues.R output_from_t_test.txt pval_adjusted_output_from_t_test.txt
## 3. Append gene names to output

Positional arguments for append_genenames.py:

  1. Input file
  2. Annotations path (gff3 file containing MISO IDs and gene names). One can download the gff3 annotation file from the MISO website
  3. Output file

Example

cd alternative_splicing_scripts/miso_scripts
python append_gene_names_to_textfile.py pval_adjusted_file annotations.gff3 output
## Generate bed and fasta files from MISO outputs Currently, these scripts only work for MISO outputs for skipped exons/cassette exons.

Scripts used to create bed files:

  • alternative_splicing_scripts/miso_scripts/extract_coordinates_from_miso_summary.py
  • alternative_splicing_scripts/miso_scripts/split_beds_into_inclusion_exclusion.py Scripts and functions used to create fasta files:
  • fastaFromBed
  • alternative_splicing_scripts/fasta_scripts/remove_chr_from_bed_file.py (if human genome fasta file does not contain "chr" prefix in front of chromosome locations, you can remove them using this script.)

Quick and dirty bash script example gluing these scripts together: example_workflows/get_beds_fastas.full_pipeline.sh

## Generate heatmaps from MISO outputs. Works for MISO outputs for any types of alternative splicing. These scripts allow visualization of MISO outputs (e.g. differentially spliced events) in a heatmap. Works for both pairwise comparison as well as groupwise comparison, but reshaping the data into a matrix format differs. Workflows for both types of comparisons are described below.
  1. Reshape MISO output textfile into a matrix suitable for plotting in R.
  2. Read matrix file, plot in R.
## Reshaping MISO output depends on whether MISO output was generated from a [pairwise comparison (Bayes factor)](#reshapebf) or [groupwise comparison (t-test)](#reshapettest) ## 1a. Reshape MISO output to matrix: pairwise comparison (Bayes Factor) Inputs for alternative_splicing_scripts/miso_scripts/prepare_data_for_clustering_misobf.py:

Arguments from option flags:

  • --sample1_name: name of sample 1
  • --sample2_name: name of sample 2

Positional arguments:

  1. MISO filtered output for pairwise comparison using Bayes Factor
  2. Output file
### Plot heatmap using `R/plot_heatmap_psi_values.R`

Example: Rscript R/plot_heatmap_psi_values.R reshaped_miso_output.txt myfigure.eps

  • Input file for R/plot_heatmap_psi_values.R should be the output after reshaping.
## 1b. Reshape MISO output to matrix: groupwise comparison (t-test) Positional arguments:
  1. MISO output file generated from groupwise comparison via t-test
  2. Filename containing sample names for group 1
  3. Filename containing sample names for group 2
  4. Output file
Positional arguments for R/plot_heatmap_psi_values.R:
  1. MISO output in matrix format (reshaped)
  2. Output EPS file
## Run ANCHOR analysis on MISO outputs. Written for analysis of cassette exons with ANCHOR. ANCHOR predicts protein binding regions within disordered segments.
  1. Translate nucleotide sequences (stored as .fasta format) to amino acid sequences.
  2. Run ANCHOR on amino acid sequences.
  3. Plot enrichment of cassette exons compared to constitutive exons
## Index ENSEMBL annotations

Script: database_scripts/index_exon_info_to_pkl.py

Takes as input .gtf file containing gene annotations from ENSEMBL and stores data as a python dictionary (.pkl file)

Positional arguments:

  1. .gtf file of gene annotations from ENSEMBL.
  2. Output for pickled python dictionary.
## Index UniProt annotations

Script: database_scripts/index_unitprot_db.py

Positional arguments:

  1. Output pickle path

Other required arguments:

  • -f: Path for .dat file containing uniprot annotations from UniProt website.
## Translate nucleotide sequences to amino acid sequences Translate nucleotide sequences to amino acid sequences using `database_scripts/create_dna_protein_summary_file.py`. Creates a 'protein summary' file that contains the amino acid sequences used for running ANCHOR. The 'protein summary' file also contains information regarding gene names, reading frames and exon numbers for quality control and sanity check purposes.

Option flags:

  • `-c --constitutive_exons: use this if your nucleotide sequences come from constitutive exons rather than cassette exons. --help for more information.

Positional arguments:

  1. Ensembl dictionary in pkl format (created from alternative_splicing_scripts/database_scripts/index_exon_info_to_pkl.py]
  2. Fasta files (nucleotides)
  3. 1 | 2 | 3. 2 extracts cassette exon. 1 extracts upstream exon. 3 extracts downstream exon.
  4. Output path
## Run ANCHOR on amino acid sequences.

Option flags for motif_scripts/run_anchor_batch.py:

  • -h, --help for more information

Positional arguments:

  1. Protein summary file from database_scripts/create_dna_protein_summary_file.py
  2. Output directory
  3. Summary output file, placed inside output directory.

Example workflow for running ANCHOR on both inclusion and exclusion exons, see: example_workflows/run_anchor.sh

## Compare ANCHOR results between two sets of exons.

Script: alternative_splicing_scripts/motif_scripts/plot_anchor_results_comparisons.py

Option flags:

  • -h, --help for details

Usage: plot_anchor_results_comparisons.py anchor_results1.txt anchor_results2.txt

## Get UniProt annotations from amino acid sequences
  1. Index .gtf file and .dat file from ENSEMBL and UniProt, respectively.
  2. Create protein summary file.
  3. Annotate protein summary file with UniProt annotations.

Script: annotate_genes_with_swissprot.py

Outputs: creates a file of amino acids for each exon (if it found a match to the .gtf file) and adds UniProt annotations. This allows one to understand the potential functions of protein segments encoded in exons.

Positional arguments:

  1. Protein summary file, created from database_scripts/create_dna_protein_summary_file.py.
  2. Output file of proteins with UniProt/SwissProt annotations.

Required flags:

## Run motif discovery and identify enriched motifs associated with motifs of RNA binding proteins

Bash script that glues a number of python scripts together: example_workflows/meme_run_pipeline.full_pipeline.sh

Outputs: MEME results from the 14 inputs (see Positional Argument 3) and TOMTOM results summarizing the 14 inputs comparing discovered motifs with experimentally derived motifs from RNA binding proteins. The results can be found in main_dir/motif_outputs/meme_run1/summary/tomtom.summary by default.

Positional arguments:

  1. Random seed - set to 0 (I found the choice of random seed no longer matters, since we do not randomly take a subset of rows, rather take the entire set of constitutive exons).
  2. Evalue - motif with a minimum E-value before MEME stops searching for additional motifs. Play around with this value (10^-5, 10^-10). Note: I found that just because MEME stops searching for additional motifs because the motif it just found was above minimum E-value, it does not mean that there does not exist any more motifs wtih E-values below the threshold. Therefore, sometimes setting E-value to something lower (e.g. 10^1) and then manually filtering out motifs that match the E-values may be less prone to false negatives.
  3. Main directory containing 3 directories: fasta_files_100bp/fasta, fasta_files_100bp/fasta_shuffled and motif_outputs.
    • fasta_files_100bp/fasta contains 14 .fasta files (i.e. inputs to MEME discovery):
      • exon_1_inclusion.fasta, exon_1_exclusion.fasta
      • exon_2_inclusion.fasta, exon_2_exclusion.fasta
      • exon_3_inclusion.fasta, exon_3_exclusion.fasta
      • intron_1_5p_inclusion.fasta, intron_1_5p_inclusion.fasta
      • intron_1_3p_inclusion.fasta, intron_1_3p_inclusion.fasta
      • intron_2_5p_inclusion.fasta, intron_2_5p_inclusion.fasta
      • intron_3_3p_inclusion.fasta, intron_3_3p_inclusion.fasta
    • fasta_files_100bp/fasta_shuffled contains the same *.fasta files as in fasta_files_100bp/fasta but shuffled using MEME tool fasta-shuffle-letters. Names are identical.
    • motif_outputs: does not need anything inside. Output from MEME will be found here in directory named meme_run_seed_$MYSEED_evalue_$MYEVALUE.
  4. Path to MEME database. Format should conform with MEME.
## Get evolutionary conservation of discovered motifs from MEME General idea:
  1. Calculate GERP scores for discovered motifs using the motif discovery pipeline. To compare GERP scores for a null distribution, use -n, --null_mode in motif_scripts/summarize_meme_results.py
  2. Plot GERP scores from discovered motifs against the null to see if there is a statistically significant enrichment.

Bash script gluing python scripts together: example_workflows/get_gerp_scores.sh

Outputs: Creates a MEME summary file that incorporates GERP conservation scores with discovered motifs. Of interest are:

  • .pkl files with GERP information (e.g. may be named summary.gerp_updated.pkl), which are used for plotting the GERP scores.
  • .summary files with GERP information, basically the .pkl file converted into a text file which can be opened in a spreadsheet or text editor.

Positional arguments:

  1. Main directory, full directory of motif outputs (e.g. if my MEME outputs are in main_dir/motif_outputs/meme_run1) then main directory is main_dir/motif_outputs)
  2. Name of directory for motif outputs (e.g., if MEME outputs are in main_dir/motif_outputs/meme_run1, then set to meme_run1)
  3. Filename containing TOMTOM summary (e.g., by default it would be the file tomtom.summary found in main_dir/motif_outputs/meme_run1/summary
  4. Directory containing GERP base-wise conservation scores (i.e. *.maf.rates for chr1, chr2, chr3... chr22, chrX, chrY). They can be downloaded here(6.3GB!)

To calculate GERP scores for a null set, set the following option flags motif_scripts/summarize_meme_results: *-n, --null_mode calculates a null distribution from discovered motifs.

Script: motif_scripts/plot_anchor_results_comparison.py

Positional arguments:

  1. GERP scores from discovered motifs: .pkl file with GERP information created from example_workflows/get_gerp_scores.sh.
  2. Null GERP scores: .pkl file with GERP information created from example_workflows/get_gerp_scores.sh with -n, --null_mode flag in motif_scripts/summarize_meme_results.py.

Script: alternative_splicing_scripts/motif_scripts

Positional argumemnts:

  1. Pickle file from alternative_splicing_scripts/motif_scripts/summarize_meme_results.py (e.g. may be named summary.gerp_updated.pkl, see here)

Outputs:

Displays a plot for MEME positions.

alternative-splicing's People

Contributors

jakeyeung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

alternative-splicing's Issues

append_genenames_to_txt

Hi Jake,

I tried to run the append_genenames to txt file python script on my MISO output in order to get the gene symbols in my output. However it seems that it only creates a new column called gsymbol and does not match events to gene symbols.

Not that good in python so I was wondering if you could help me out?

I know optparse is deprecated and now we use argparse but Im not sure thats important for this.

What i do is simply run the script with my MISO output for SE, with the v2 gff3 files (spliced exon) and it created a txtfile with an extra (empty) column called gsymbol.

Any thoughts?

Regards,

Ittai

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.