alternative-splicing

This repository contains a number of useful scripts to analyze alternative splicing. Most of these scripts were used to process outputs of MISO. Therefore, these scripts are most useful if you have outputs from MISO to process. No guarantees using these scripts, but I hope you can glean something useful from them.

Brief highlights of this repository, with example workflows:

Identify differentially spliced events between two groups of samples using t-test.
Create bed files and fasta files from a list of MISO events.
Generate heatmaps of PSI values either in pair-wise comparison or group comparison.
Translate nucleotide sequences of cassette exons (or constitutive exons) to amino acid sequences
Wrapper function to perform ANCHOR on exons to identify exons that may mediate protein-protein interactions.
Wrapper function to perform MEME and TOMTOM analysis near genic regions of cassette exons to identify enriched motifs near alternatively spliced events and associate RNA binding proteins to these motifs.
Perform conservation analysis of enriched motifs by calculating GERP scores of sites contributing to each motif.
Visualize distribution of sites contributing to each enriched motif.

Installation

Packages and libraries required

Python

Python 2.6 or 2.7
SciPy
NumPy
Bio (for UniProt/SwissProt annotations)

R

R 3.0.0 or R 3.0.1. May also work with R 2.14 and 2.15.
ggplot2
extrafont
grid
RColorBrewer
gplots

Computational biology software

Cloning repository

cd dir_to_place_repository
git clone https://github.com/jakeyeung/alternative-splicing.git

## T-test to find differentially spliced events between two groups of samples

Procedure:

Perform t-test using t_test_miso_output.py.
Adjust p-values for multiple-test correction (Benjamini-Hochberg method) using adjust_pvalues.R
Append gene names to output file

## 1. Perform t-test on MISO outputs

Inputs for t_test_miso_output.py

--group1_file: textfile containing sample names in group 1
--group2_file: textfile containing sample names in group 2
--main_dir: directory containing sample names (should match group 1 and group 2). Inside each directory contains MISO raw outputs (MISO events catalogued by chromosomes)
--output_directory: directory for output
--output_filename: name of output file
--min_counts: minimum junction read counts in a sample to be considered into a t-test. This value depends on the depth of your RNA-Seq data. Best practices suggest min_counts of 10 for a depth in the order of 50 million mapped read pairs

Example

cd alternative_splicing_scripts/miso_scripts
python t_test_miso_output.py --group1_file /path/to/group_1_samples.txt \ 
							 --group2_file /path/to/group_2_samples.txt \
							 --main_dir /path/containing/miso/outputs \
							 --output_directory output/path/file.txt \ 
							 --min_counts 10

## 2. Adjust p-values for multiple-test correction

Positional arguments for `adjust_pvals.R`

Input file
Output file

Example

cd R
Rscript adjust_pvalues.R output_from_t_test.txt pval_adjusted_output_from_t_test.txt

## 3. Append gene names to output

Positional arguments for append_genenames.py:

Input file
Annotations path (gff3 file containing MISO IDs and gene names). One can download the gff3 annotation file from the MISO website
Output file

Example

cd alternative_splicing_scripts/miso_scripts
python append_gene_names_to_textfile.py pval_adjusted_file annotations.gff3 output

## Generate bed and fasta files from MISO outputs Currently, these scripts only work for MISO outputs for skipped exons/cassette exons.

Workflow

Scripts used to create bed files:

alternative_splicing_scripts/miso_scripts/extract_coordinates_from_miso_summary.py
alternative_splicing_scripts/miso_scripts/split_beds_into_inclusion_exclusion.py Scripts and functions used to create fasta files:
fastaFromBed
alternative_splicing_scripts/fasta_scripts/remove_chr_from_bed_file.py (if human genome fasta file does not contain "chr" prefix in front of chromosome locations, you can remove them using this script.)

Quick and dirty bash script example gluing these scripts together: example_workflows/get_beds_fastas.full_pipeline.sh

## Generate heatmaps from MISO outputs. Works for MISO outputs for any types of alternative splicing. These scripts allow visualization of MISO outputs (e.g. differentially spliced events) in a heatmap. Works for both pairwise comparison as well as groupwise comparison, but reshaping the data into a matrix format differs. Workflows for both types of comparisons are described below.

Procedure

Reshape MISO output textfile into a matrix suitable for plotting in R.
Read matrix file, plot in R.

## Reshaping MISO output depends on whether MISO output was generated from a [pairwise comparison (Bayes factor)](#reshapebf) or [groupwise comparison (t-test)](#reshapettest) ## 1a. Reshape MISO output to matrix: pairwise comparison (Bayes Factor) Inputs for alternative_splicing_scripts/miso_scripts/prepare_data_for_clustering_misobf.py:

Arguments from option flags:

--sample1_name: name of sample 1
--sample2_name: name of sample 2

Positional arguments:

MISO filtered output for pairwise comparison using Bayes Factor
Output file

### Plot heatmap using `R/plot_heatmap_psi_values.R`

Example: Rscript R/plot_heatmap_psi_values.R reshaped_miso_output.txt myfigure.eps

Input file for R/plot_heatmap_psi_values.R should be the output after reshaping.

## 1b. Reshape MISO output to matrix: groupwise comparison (t-test) Positional arguments:

MISO output file generated from groupwise comparison via t-test
Filename containing sample names for group 1
Filename containing sample names for group 2
Output file

Positional arguments for R/plot_heatmap_psi_values.R:

MISO output in matrix format (reshaped)
Output EPS file

## Run ANCHOR analysis on MISO outputs. Written for analysis of cassette exons with ANCHOR. ANCHOR predicts protein binding regions within disordered segments.

Procedure

Translate nucleotide sequences (stored as .fasta format) to amino acid sequences.
Run ANCHOR on amino acid sequences.
Plot enrichment of cassette exons compared to constitutive exons

## Index ENSEMBL annotations

Script: database_scripts/index_exon_info_to_pkl.py

Takes as input .gtf file containing gene annotations from ENSEMBL and stores data as a python dictionary (.pkl file)

Positional arguments:

.gtf file of gene annotations from ENSEMBL.
Output for pickled python dictionary.

## Index UniProt annotations

Script: database_scripts/index_unitprot_db.py

Positional arguments:

Output pickle path

Other required arguments:

-f: Path for .dat file containing uniprot annotations from UniProt website.

## Translate nucleotide sequences to amino acid sequences Translate nucleotide sequences to amino acid sequences using `database_scripts/create_dna_protein_summary_file.py`. Creates a 'protein summary' file that contains the amino acid sequences used for running ANCHOR. The 'protein summary' file also contains information regarding gene names, reading frames and exon numbers for quality control and sanity check purposes.

Option flags:

`-c --constitutive_exons: use this if your nucleotide sequences come from constitutive exons rather than cassette exons. --help for more information.

Positional arguments:

Ensembl dictionary in pkl format (created from alternative_splicing_scripts/database_scripts/index_exon_info_to_pkl.py]
Fasta files (nucleotides)
1 | 2 | 3. 2 extracts cassette exon. 1 extracts upstream exon. 3 extracts downstream exon.
Output path

## Run ANCHOR on amino acid sequences.

Option flags for motif_scripts/run_anchor_batch.py:

-h, --help for more information

Positional arguments:

Protein summary file from database_scripts/create_dna_protein_summary_file.py
Output directory
Summary output file, placed inside output directory.

Example workflow for running ANCHOR on both inclusion and exclusion exons, see: example_workflows/run_anchor.sh

## Compare ANCHOR results between two sets of exons.

Script: alternative_splicing_scripts/motif_scripts/plot_anchor_results_comparisons.py

Option flags:

-h, --help for details

Usage: plot_anchor_results_comparisons.py anchor_results1.txt anchor_results2.txt

## Get UniProt annotations from amino acid sequences

Procedure

Index .gtf file and .dat file from ENSEMBL and UniProt, respectively.
Create protein summary file.
Annotate protein summary file with UniProt annotations.

Script: annotate_genes_with_swissprot.py

Outputs: creates a file of amino acids for each exon (if it found a match to the .gtf file) and adds UniProt annotations. This allows one to understand the potential functions of protein segments encoded in exons.

Positional arguments:

Protein summary file, created from database_scripts/create_dna_protein_summary_file.py.
Output file of proteins with UniProt/SwissProt annotations.

Required flags:

-f Indexed file of UniProt database (i.e. output from database_scripts/index_unitprot_db.py).

## Run motif discovery and identify enriched motifs associated with motifs of RNA binding proteins

Bash script that glues a number of python scripts together: example_workflows/meme_run_pipeline.full_pipeline.sh

Outputs: MEME results from the 14 inputs (see Positional Argument 3) and TOMTOM results summarizing the 14 inputs comparing discovered motifs with experimentally derived motifs from RNA binding proteins. The results can be found in main_dir/motif_outputs/meme_run1/summary/tomtom.summary by default.

Positional arguments:

Random seed - set to 0 (I found the choice of random seed no longer matters, since we do not randomly take a subset of rows, rather take the entire set of constitutive exons).
Evalue - motif with a minimum E-value before MEME stops searching for additional motifs. Play around with this value (10^-5, 10^-10). Note: I found that just because MEME stops searching for additional motifs because the motif it just found was above minimum E-value, it does not mean that there does not exist any more motifs wtih E-values below the threshold. Therefore, sometimes setting E-value to something lower (e.g. 10^1) and then manually filtering out motifs that match the E-values may be less prone to false negatives.
Main directory containing 3 directories: fasta_files_100bp/fasta, fasta_files_100bp/fasta_shuffled and motif_outputs.
- fasta_files_100bp/fasta contains 14 .fasta files (i.e. inputs to MEME discovery):
  - exon_1_inclusion.fasta, exon_1_exclusion.fasta
  - exon_2_inclusion.fasta, exon_2_exclusion.fasta
  - exon_3_inclusion.fasta, exon_3_exclusion.fasta
  - intron_1_5p_inclusion.fasta, intron_1_5p_inclusion.fasta
  - intron_1_3p_inclusion.fasta, intron_1_3p_inclusion.fasta
  - intron_2_5p_inclusion.fasta, intron_2_5p_inclusion.fasta
  - intron_3_3p_inclusion.fasta, intron_3_3p_inclusion.fasta
- fasta_files_100bp/fasta_shuffled contains the same *.fasta files as in fasta_files_100bp/fasta but shuffled using MEME tool fasta-shuffle-letters. Names are identical.
- motif_outputs: does not need anything inside. Output from MEME will be found here in directory named meme_run_seed_$MYSEED_evalue_$MYEVALUE.
Path to MEME database. Format should conform with MEME.

## Get evolutionary conservation of discovered motifs from MEME General idea:

Calculate GERP scores for discovered motifs using the motif discovery pipeline. To compare GERP scores for a null distribution, use -n, --null_mode in motif_scripts/summarize_meme_results.py
Plot GERP scores from discovered motifs against the null to see if there is a statistically significant enrichment.

1. Calculate GERP scores

Bash script gluing python scripts together: example_workflows/get_gerp_scores.sh

Outputs: Creates a MEME summary file that incorporates GERP conservation scores with discovered motifs. Of interest are:

.pkl files with GERP information (e.g. may be named summary.gerp_updated.pkl), which are used for plotting the GERP scores.
.summary files with GERP information, basically the .pkl file converted into a text file which can be opened in a spreadsheet or text editor.

Positional arguments:

Main directory, full directory of motif outputs (e.g. if my MEME outputs are in main_dir/motif_outputs/meme_run1) then main directory is main_dir/motif_outputs)
Name of directory for motif outputs (e.g., if MEME outputs are in main_dir/motif_outputs/meme_run1, then set to meme_run1)
Filename containing TOMTOM summary (e.g., by default it would be the file tomtom.summary found in main_dir/motif_outputs/meme_run1/summary
Directory containing GERP base-wise conservation scores (i.e. *.maf.rates for chr1, chr2, chr3... chr22, chrX, chrY). They can be downloaded here(6.3GB!)

To calculate GERP scores for a null set, set the following option flags motif_scripts/summarize_meme_results: *-n, --null_mode calculates a null distribution from discovered motifs.

2. Plot GERP scores

Script: motif_scripts/plot_anchor_results_comparison.py

Positional arguments:

GERP scores from discovered motifs: .pkl file with GERP information created from example_workflows/get_gerp_scores.sh.
Null GERP scores: .pkl file with GERP information created from example_workflows/get_gerp_scores.sh with -n, --null_mode flag in motif_scripts/summarize_meme_results.py.

Plot motif distributions

Script: alternative_splicing_scripts/motif_scripts

Positional argumemnts:

Pickle file from alternative_splicing_scripts/motif_scripts/summarize_meme_results.py (e.g. may be named summary.gerp_updated.pkl, see here)

Outputs:

Displays a plot for MEME positions.

jakeyeung / alternative-splicing Goto Github PK