bioconductor / gseabase Goto Github PK

View Code? Open in Web Editor NEW

5.0 11.0 11.0 360 KB

Gene set enrichment data structures and methods

Home Page: https://bioconductor.org/packages/GSEABase

R 100.00%

core-package bioconductor-package

gseabase's Introduction

BiocManager

Overview

The BiocManager package, as the modern successor package to BiocInstaller, allows users to install and manage packages from the Bioconductor project. Bioconductor focuses on the statistical analysis and comprehension of high-throughput genomic data.

Current Bioconductor packages are available on a ‘release’ version intended for every-day use, and a ‘devel’ version where new features are continually introduced. A new release version is created every six months. Using the BiocManager package helps users accurately install packages from the appropriate release.

available() shows all packages associated with a search pattern
install() installs and/or updates packages either CRAN or Bioconductor
repositories() shows all package repository URL endpoints
valid() checks and returns packages that are out-of-date or too new
version() returns the current Bioconductor version number

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

Usage

Checking Bioconductor version currently installed

BiocManager::version()
#> [1] '3.15'

Installing Bioconductor packages

BiocManager::install(c("GenomicRanges", "SummarizedExperiment"))

Verifying a valid Bioconductor installation

BiocManager::valid()
#> [1] TRUE

More information

Please see the package vignette for more detailed information such as changing Bioconductor version, offline use, and other advanced usage.

Getting help

To report apparent bugs, create a minimal and reproducible example on GitHub.

gseabase's People

Contributors

Stargazers

Watchers

Forkers

waldronlab vjcitn llrs kevinrue chungtseng mariodejung rogerzou0108 jorainer jwokaty villafup

gseabase's Issues

mapIdentifiers() does not work with EnsDb.* packages

In this thread at the Bioconductor support site it came out that the function GSEABase::mapIdentifiers() does not play well with EnsDb.* annotation packages. Here is a minimal reproducible example (it requires the EnsDb.Hsapiens.v75 package being installed:

download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.0/c2.cp.kegg.v7.0.symbols.gmt",  "c2.cp.kegg.v7.0.symbols.gmt")
keggsym <- getGmt("c2.cp.kegg.v7.0.symbols.gmt",  geneIdType=SymbolIdentifier())
mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
Error in eval(parse(text = pkg)) : 
  object 'EnsDb.Hsapiens.v75.db' not found
17: eval(parse(text = pkg))
16: eval(parse(text = pkg))
15: eval(parse(text = pkg))
14: getAnnMap(toupper(geneIdType(from)), annotation(from))
13: .mapIdentifiers_selectMaps(from, to)
12: .mapIdentifiers_map(ids, type[[1]], type[[2]], verbose)
11: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
10: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
9: eval(call, parent.frame())
8: eval(call, parent.frame())
7: callGeneric(what, to, from = geneIdType(what), ..., verbose = verbose)
6: FUN(X[[i]], ...)
5: FUN(X[[i]], ...)
4: lapply(what, mapIdentifiers, to, ..., verbose = verbose)
3: GeneSetCollection(lapply(what, mapIdentifiers, to, ..., verbose = verbose))
2: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
1: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))

you can get around this error with the following hack, but it leads to another error:

EnsDb.Hsapiens.v75.db <- EnsDb.Hsapiens.v75
mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
Error in .select(x = x, keys = keys, columns = columns, keytype = keytype,  : 
  Argument keytype is mandatory if keys is a character vector!
21: stop("Argument keytype is mandatory if keys is a", " character vector!")
20: .select(x = x, keys = keys, columns = columns, keytype = keytype, 
        ...)
19: select(x, keys = keys(x), columns = col)
18: select(x, keys = keys(x), columns = col)
17: withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
16: suppressWarnings(tab <- select(x, keys = keys(x), columns = col))
15: AnnotationDbi:::makeFlatBimapUsingSelect(db, col = map)
14: getAnnMap(toupper(geneIdType(from)), annotation(from))
13: .mapIdentifiers_selectMaps(from, to)
12: .mapIdentifiers_map(ids, type[[1]], type[[2]], verbose)
11: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
10: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
9: eval(call, parent.frame())
8: eval(call, parent.frame())
7: callGeneric(what, to, from = geneIdType(what), ..., verbose = verbose)
6: FUN(X[[i]], ...)
5: FUN(X[[i]], ...)
4: lapply(what, mapIdentifiers, to, ..., verbose = verbose)
3: GeneSetCollection(lapply(what, mapIdentifiers, to, ..., verbose = verbose))
2: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
1: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))

This error has been already reported in the past in Biostars in this thread, in the context of using the function annotate::getSYMBOL() with an EnsDb.* package. In both cases, the problem seems to be associated with the function annotate::getAnnMap(), but maybe there is a solution within GSEABase.

Faster `incidence()`

Coincidentally, I had written a faster version of incidence as part of another function that makes use of matrix multiplication. The key is to only define the positions of nonzero values, since these are less frequent, though it does add an extra Matrix import for sparse matrices. This could be rewritten to use regular matrices by starting with a matrix of zeros and using x.idx to replace specific values with 1L; however, I have not tried this, and there may be better practices for dealing with matrices.

library(Matrix)

# y is a "GeneSet" or "GeneSetCollection" object
fast_incidence <- function(y) {
  y <- geneIds(y) # named list of gene sets
  x <- stack(y, drop = TRUE)[, 2L:1L] # columns "ind" (set identifier) and "values" (genes)

  # Convert identifiers to indices (positions of values in sparseMatrix)
  mat.dimnames <- lapply(x, unique)
  x.idx <- mapply(match, x, mat.dimnames)

  mat <- sparseMatrix(i = x.idx[, 1L], j = x.idx[, 2L], x = 1L, dimnames = mat.dimnames)

  return(mat)
}

By flipping the columns of the stack results, we end up with a sparseMatrix with genes as columns and set identifiers as rows. When I was testing this with sets from msigdb (v1.10.0), fast_incidence was ~25x faster (only took about 2 seconds compared to ~50). It also only used 30.2 MB compared to the 2 GB allocated to the incidence results.

Missing dependency: RUnit

Missing RUnit in Suggests in the DESCRIPTION file. See https://bioconductor.org/checkResults/3.17/testing-LATEST/GSEABase/nebbiolo2-checksrc.html.

goSlim() counting differs from intersect()

When using the goSlim() function, the results will include counts for GO terms in a collection that also match "parent" terms.

The intersect() function does not include GO terms in a collection which also match "parent" terms.

Here's what I've done to identify this issue.

If you run the example code for the goSlim() function, a data frame is returned.

# Save goSlim as data frame
slimdf <- goSlim(myCollection, slim, "MF")

Then, get a list of GOslim IDs and their corresponding GO IDs:

# Create list of GOslims and corresponding GO IDs
gomap.list <- as.list(GOMFOFFSPRING[rownames(slimdf)])

The next step is where things don't seem to work the way I expected (based on how the goSlim() function works):

# Identify GO IDs in myCollection which are also found in the GO slim mappings
mapped.list <- lapply(gomap, intersect, ids(myCollection))

# View list structure
str(mapped.list)

Output (truncated for brevity):

List of 24
 $ GO:0000166: chr(0) 
 $ GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...
 $ GO:0003676: chr "GO:0003677"
 $ GO:0003677: chr(0)

The issue can be seen in the output above.

GO:0003677 (which is in the myIDs vector in the example code) is mapped (i.e. intersects) to the following GOslim IDs:

GO:0003674
GO:0003676

However, GO:0003677 does not get identified as an intersection with itself:

$ GO:0003677: chr(0)

I am expecting the output to look like this:

List of 24
 $ GO:0000166: chr(0) 
 $ GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...
 $ GO:0003676: chr "GO:0003677"
 $ GO:0003677: chr "GO:0003677"

Is this the expected behavior?

It seems like a small inconsistency that the intersect() function fails to make this match, but the goSlim() function catches it.

Vignette FIXMEs

I have recently noticed that the vignette has the date hard coded (line 41), and indeed the last commit done in the vignette seems to be from Oct 19, 2013.

It also has some warnings and messages after the title (see the preliminaries under line 51):

> ## FIXME: <adjMat> adjacency matrix -- color w. +/- 1
> ## FIXME: limma topTable --> GeneColorSet
> ## w. verbose=TRUE
> library(GSEABase)
> library(hgu95av2.db)
> library(GO.db)

In total there are 6 FIXME displaying in the vignette.

Long GOslim descriptions are truncated

Here's an example output from the goSlim method:

           Count     Percent                                   Term
GO:0000003     2  0.16934801                           reproduction
GO:0000278     0  0.00000000                     mitotic cell cycle
GO:0000902     0  0.00000000                     cell morphogenesis
GO:0002376    29  2.45554615                  immune system process
GO:0003013     1  0.08467401             circulatory system process
GO:0005975     0  0.00000000         carbohydrate metabolic process
GO:0006091     0  0.00000000 generation of precursor metabolites...
GO:0006259     7  0.59271804                  DNA metabolic process
GO:0006397     1  0.08467401                        mRNA processing
GO:0006399     0  0.00000000                 tRNA metabolic process
GO:0006412     0  0.00000000                            translation
GO:0006457     0  0.00000000                        protein folding
GO:0006464     2  0.16934801 cellular protein modification proce...
GO:0006520     8  0.67739204 cellular amino acid metabolic proce...
GO:0006605     1  0.08467401                      protein targeting
GO:0006629     2  0.16934801                lipid metabolic process
GO:0006790     0  0.00000000      sulfur compound metabolic process
GO:0006810     5  0.42337003                              transport
GO:0006913     0  0.00000000            nucleocytoplasmic transport
GO:0006914     9  0.76206605                              autophagy
GO:0006950    31  2.62489416                     response to stress
GO:0007005     2  0.16934801             mitochondrion organization
GO:0007009     0  0.00000000           plasma membrane organization
GO:0007010     2  0.16934801              cytoskeleton organization
GO:0007034     0  0.00000000                     vacuolar transport
GO:0007049     2  0.16934801                             cell cycle
GO:0007059     0  0.00000000                 chromosome segregation
GO:0007155    37  3.13293819                          cell adhesion
GO:0007165    20  1.69348010                    signal transduction
GO:0007267     4  0.33869602                    cell-cell signaling
GO:0007568     1  0.08467401                                  aging
GO:0008150   634 53.68331922                     biological_process
GO:0008219    26  2.20152413                             cell death
GO:0008283     3  0.25402202          cell population proliferation
GO:0009056    17  1.43945809                      catabolic process
GO:0009058     4  0.33869602                   biosynthetic process
GO:0009790    15  1.27011008                     embryo development
GO:0015031     1  0.08467401                      protein transport
GO:0015979     0  0.00000000                         photosynthesis
GO:0016192     3  0.25402202             vesicle-mediated transport
GO:0019748     0  0.00000000            secondary metabolic process
GO:0021700     0  0.00000000               developmental maturation
GO:0022607     7  0.59271804            cellular component assembly
GO:0022618     1  0.08467401     ribonucleoprotein complex assembly
GO:0030154    85  7.19729043                   cell differentiation
GO:0030198     5  0.42337003      extracellular matrix organization
GO:0030705     0  0.00000000 cytoskeleton-dependent intracellula...
GO:0032196     0  0.00000000                          transposition
GO:0034330     6  0.50804403             cell junction organization
GO:0034641     9  0.76206605 cellular nitrogen compound metaboli...
GO:0034655     1  0.08467401 nucleobase-containing compound cata...
GO:0040007     1  0.08467401                                 growth
GO:0040011     9  0.76206605                             locomotion
GO:0042254     0  0.00000000                    ribosome biogenesis
GO:0042592     8  0.67739204                    homeostatic process
GO:0043473     0  0.00000000                           pigmentation
GO:0044281     8  0.67739204       small molecule metabolic process
GO:0044403     0  0.00000000                      symbiotic process
GO:0048646     4  0.33869602 anatomical structure formation invo...
GO:0048856   148 12.53175275       anatomical structure development
GO:0048870     9  0.76206605                          cell motility
GO:0050877     3  0.25402202                 nervous system process
GO:0051276     7  0.59271804                chromosome organization
GO:0051301     0  0.00000000                          cell division
GO:0051604     0  0.00000000                     protein maturation
GO:0055085     0  0.00000000                transmembrane transport
GO:0061024     0  0.00000000                  membrane organization
GO:0065003     1  0.08467401    protein-containing complex assembly
GO:0071554     0  0.00000000 cell wall organization or biogenesi...
GO:0071941     0  0.00000000       nitrogen cycle metabolic process
GO:0140014     0  0.00000000               mitotic nuclear division

Some of the GOslims are truncated (e.g. generation of precursor metabolites...). I'd like to have the full description (e.g. generation of precursor metabolites and energy).

Is there a way to prevent this from happening? I've looked for some arguments to set the number of characters, but can't find any.

increasing metadata flexibility/availability?

https://github.com/kevinrue/Hancock/blob/master/vignettes/concepts.Rmd

by @kevinrue shows how GeneSet and GeneSetCollection can play roles in definition of cell type
signatures.

It might be desirable to have longDescription in GeneSet class to have class "ANY" so that an object or list could be used.

Likewise it would be good to have a metadata component for GeneSetCollection.