Git Product home page Git Product logo

gsea-api's Introduction

GSEA API for Pandas

Build Status codecov MIT License DOI

Pandas API for Gene Set Enrichment Analysis in Python (GSEApy, cudaGSEA, GSEA)

  • aims to provide a unified API for various GSEA implementations; uses pandas DataFrames and a hierarchy of Pythonic classes.
  • file exports (exporting input for GSEA) use low-level numpy functions and are much faster than in pandas
  • aims to allow researchers to easily compare different implementations of GSEA, and to integrate those in projects which require high-performance GSEA (e.g. massive screening for drug-repositioning)
  • provides useful utilities for work with GMT files, or gene sets and pathways in general in Python

Installation

To install the API use:

pip3 install gsea_api

See below for the instructions on installation of specific GSEA implementations.

Example usage

from pandas import read_table
from gsea_api.expression_set import ExpressionSet
from gsea_api.gsea import GSEADesktop
from gsea_api.molecular_signatures_db import GeneSets

reactome_pathways = GeneSets.from_gmt('ReactomePathways.gmt')

gsea = GSEADesktop()

design = ['Disease', 'Disease', 'Disease', 'Control', 'Control', 'Control']
matrix = read_table('expression_data.tsv', index_col='Gene')

result = gsea.run(
    # note: contrast() is not necessary in this simple case
    ExpressionSet(matrix, design).contrast('Disease', 'Control'),
    reactome_pathways,
    metric='Signal2Noise',
    permutations=1000
)

Where expression_data.tsv is in the following format:

Gene	Patient_1	Patient_2	Patient_3	Patient_4	Patient_5	Patient_6
TACC2	0.2	0.1	0.4	0.6	0.7	2.1
TP53	2.3	0.2	2.1	2.0	0.3	0.6

MSigDB integration

Molecular Signatures Database (MSigDB) can be downloaded from the Broad Institute GSEA website. It provides expert-curated gene set collections, as well as curated subset of pathway databases (Reactome, KEGG, Biocarta, Gene Ontology) trimmed to remove redundant, overlapping and and otherwise little-value terms (if needed).

You can download all the pathways collections at once (search for ZIPped MSigDB on the download page). After downloading and un-zipping (e.g., to a local directory named msigdb), you can access the gene sets from MSigDB with:

from gsea_api.molecular_signatures_db import MolecularSignaturesDatabase

msigdb = MolecularSignaturesDatabase('msigdb', version=7.1)
msigdb.gene_sets

msigdb.gene_sets returns a list of dictionaries describing auto-detected pathways:

[
    {'name': 'c1.all', 'id_type': 'symbols'},
    {'name': 'c1.all', 'id_type': 'entrez'},
    {'name': 'c2.cp.reactome', 'id_type': 'symbols'},
    {'name': 'c2.cp.reactome', 'id_type': 'entrez'}
    # etc..
]

Information about the location on disk and version are available in msigdb.path and msigdb.version.

msigdb.load loads the specific collection into a GeneSets object:

> kegg_pathways = msigdb.load('c2.cp.kegg', 'symbols')
> print(kegg_pathways)
<GeneSets 'c2.cp.kegg' with 186 gene sets>

This object can be passed to any of the supported GSEA implementations; please see below for a detailed description of the GeneSets object.

GeneSets objects

GeneSets represents a collection of sets of genes, where each set is represented as GeneSet object.

You can check the number of sets contained within a collection with:

> len(kegg_pathways)
186

The gene sets are accessible with gene_sets (tuple) and gene_sets_by_name (dict) properties:

> kegg_pathways.gene_sets[:2]
(<GeneSet 'KEGG_TIGHT_JUNCTION' with 132 genes>, <GeneSet 'KEGG_RNA_DEGRADATION' with 59 genes>)
> kegg_pathways.gene_sets_by_name
{
    'KEGG_TIGHT_JUNCTION': <GeneSet 'KEGG_TIGHT_JUNCTION' with 132 genes>,
    'KEGG_RNA_DEGRADATION': <GeneSet 'KEGG_RNA_DEGRADATION' with 59 genes>
    # etc.
 }

Subsetting collections

Sometimes only a subset of genes is measured in an experiment. You can remove gene sets which do not contain any of the measured genes from the collection:

> measured_genes = {'APOE', 'CYB5R1', 'FCER1G', 'PVR', 'HK2'}
> measured_subset = kegg_pathways.subset(measured_genes)
> print(measured_subset)
<GeneSets with 12 gene sets>

The skipped gene sets are accessible in measured_subset.empty_gene_sets for inspection.

Trimming collections

> kegg_pathways.trim(min_genes=10, max_genes=20)
<GeneSets with 21 gene sets>

Prettify names

def prettify_kegg_name(gene_set):
    return gene_set.name.replace('KEGG_', '').replace('_', ' ')

kegg_pathways_pretty = kegg_pathways.format_names(prettify_kegg_name)
kegg_pathways_pretty.gene_sets[:2]
# (<GeneSet 'TIGHT JUNCTION' with 132 genes>, <GeneSet 'RNA DEGRADATION' with 59 genes>)

For MSigDB 7.4+:

def pretty_reactome_name(gene_set):
    return gene_set.metadata['DESCRIPTION_BRIEF']

reactome_pathways_pretty = reactome_pathways.format_names(pretty_reactome_name)
reactome_pathways_pretty.gene_sets[:2]
#

Other properties

Other properties and methods offered by GeneSets include:

  • all_genes: return a set of all genes which are covered by the gene sets in the collection
  • name: the name of the collection
  • to_frame() return a pandas DataFrame describing membership of the genes (gene sets = rows, genes = columns), which can be used for UpSet visualisation (e.g. with ComplexUpset)
  • to_gmt(path: str) exports the gene set to a GMT (Gene Matrix Transposed) file

Installing GSEA implementations

Following GSEA implementations are supported:

GSEA from Broad Institute

Login/register on the official GSEA website and download the gsea_3.0.jar file (or a newer version).

Provide the location of the downloaded file to GSEADesktop() using gsea_jar_path argument, e.g.:

gsea = GSEADesktop(gsea_jar_path='downloads/gsea_3.0.jar')

GSEApy

To use gsea.py please install it with:

pip3 install gseapy

Use it with:

from gsea_api.gsea import GSEApy

gsea = GSEApy()

cudaGSEA

Please clone this fork of cudaGSEA and compile the binary version:

git clone https://github.com/krassowski/cudaGSEA
cd cudaGSEA/cudaGSEA/src/
# if on Ubuntu:
# sudo apt install nvidia-cuda-toolkit
# whereis nvcc
export CUDA_HOME=/usr
export R_INC=/usr/share/R/include
export RCPP_INC=/usr/local/lib/R/site-library/Rcpp/include
make cudaGSEA

depending on your GPU and drivers you may see Unsupported gpu architecture 'compute_20' error; simply edit Makefile removing -gencode arch=compute_20,code=compute_20 (see this askUbuntu post)

You can also try to use the original version, which does not implement FDR calculations.

Use it with:

from gsea_api.gsea import cudaGSEA

# CPU implementation can be used with use_cpu=True
gsea = cudaGSEA(fdr='full', use_cpu=False, path='cudaGSEA/cudaGSEA/src/cudaGSEA')

Citation

DOI

Please also cite the authors of the wrapped tools that you use.

References

The initial version of this code was written for a Master thesis project at Imperial College London.

gsea-api's People

Contributors

krassowski avatar zh1peng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

jiawu zh1peng

gsea-api's Issues

GeneSets design improvements

  • rename to GeneSets (fb8e265)
  • add len() support (fb8e265)
  • GeneSets and GeneSet, should provide a useful representation (__repr__)
  • GeneSets.gene_sets is a set, which is nice and non-redundant; however, from the user perspective having it as a dict might be nicer. Maybe add GeneSets.gene_sets_by_name?
  • GeneSets.all_identifiers is not necessarily intuitive - should it be renamed to GeneSets.all_genes?

Using Msigdb gene sets on regular gseapy

Hello,

Thank you so much for sharing your code and make retrieving of datasets from MSigDB easier in Python!

I am currently trying to run gseapy on some datasets. While I am able to run successfully enrichR, when I try to provide a GMT file downloaded from MSigDB it is not been recognized. That is how I came across your code, and has been very useful.

I was wondering if you tried using your dictionaries on regular gseapy. If you have, I would request a small vignette for that.

In my case I am trying something like:
res = gseapy.prerank(rnk=my_list_genes_ranked gene_sets=geneset_msigdb ,permutation_num=100)

Where geneset_msigdb = data obtained with your code (measured_subset)

And I get an error saying that the gene_sets could not be parsed.

Thank you for your work!

A method to compare gene sets?

  • having a venn diagram, or just overlap vs differences summary would make a common task of comparing two gene sets easier.
  • many methods do not work well with redundant pathways; it would be nice to have a way to get a list of pathways with very high overlap and to filter those out using some sort of a criterion (e.g. keeping the smaller/larger pathways)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.