Git Product home page Git Product logo

gseabase's Introduction

BiocManager

CRAN status CRAN release CRAN downloads

Overview

The BiocManager package, as the modern successor package to BiocInstaller, allows users to install and manage packages from the Bioconductor project. Bioconductor focuses on the statistical analysis and comprehension of high-throughput genomic data.

Current Bioconductor packages are available on a ‘release’ version intended for every-day use, and a ‘devel’ version where new features are continually introduced. A new release version is created every six months. Using the BiocManager package helps users accurately install packages from the appropriate release.

  • available() shows all packages associated with a search pattern
  • install() installs and/or updates packages either CRAN or Bioconductor
  • repositories() shows all package repository URL endpoints
  • valid() checks and returns packages that are out-of-date or too new
  • version() returns the current Bioconductor version number

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

Usage

Checking Bioconductor version currently installed

BiocManager::version()
#> [1] '3.15'

Installing Bioconductor packages

BiocManager::install(c("GenomicRanges", "SummarizedExperiment"))

Verifying a valid Bioconductor installation

BiocManager::valid()
#> [1] TRUE

More information

Please see the package vignette for more detailed information such as changing Bioconductor version, offline use, and other advanced usage.

Getting help

To report apparent bugs, create a minimal and reproducible example on GitHub.

gseabase's People

Contributors

dtenenba avatar dvantwisk avatar hpages avatar jwokaty avatar lshep avatar mtmorgan avatar nturaga avatar sonali-bioc avatar villafup avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gseabase's Issues

mapIdentifiers() does not work with EnsDb.* packages

In this thread at the Bioconductor support site it came out that the function GSEABase::mapIdentifiers() does not play well with EnsDb.* annotation packages. Here is a minimal reproducible example (it requires the EnsDb.Hsapiens.v75 package being installed:

download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.0/c2.cp.kegg.v7.0.symbols.gmt",  "c2.cp.kegg.v7.0.symbols.gmt")
keggsym <- getGmt("c2.cp.kegg.v7.0.symbols.gmt",  geneIdType=SymbolIdentifier())
mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
Error in eval(parse(text = pkg)) : 
  object 'EnsDb.Hsapiens.v75.db' not found
17: eval(parse(text = pkg))
16: eval(parse(text = pkg))
15: eval(parse(text = pkg))
14: getAnnMap(toupper(geneIdType(from)), annotation(from))
13: .mapIdentifiers_selectMaps(from, to)
12: .mapIdentifiers_map(ids, type[[1]], type[[2]], verbose)
11: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
10: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
9: eval(call, parent.frame())
8: eval(call, parent.frame())
7: callGeneric(what, to, from = geneIdType(what), ..., verbose = verbose)
6: FUN(X[[i]], ...)
5: FUN(X[[i]], ...)
4: lapply(what, mapIdentifiers, to, ..., verbose = verbose)
3: GeneSetCollection(lapply(what, mapIdentifiers, to, ..., verbose = verbose))
2: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
1: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))

you can get around this error with the following hack, but it leads to another error:

EnsDb.Hsapiens.v75.db <- EnsDb.Hsapiens.v75
mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
Error in .select(x = x, keys = keys, columns = columns, keytype = keytype,  : 
  Argument keytype is mandatory if keys is a character vector!
21: stop("Argument keytype is mandatory if keys is a", " character vector!")
20: .select(x = x, keys = keys, columns = columns, keytype = keytype, 
        ...)
19: select(x, keys = keys(x), columns = col)
18: select(x, keys = keys(x), columns = col)
17: withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
16: suppressWarnings(tab <- select(x, keys = keys(x), columns = col))
15: AnnotationDbi:::makeFlatBimapUsingSelect(db, col = map)
14: getAnnMap(toupper(geneIdType(from)), annotation(from))
13: .mapIdentifiers_selectMaps(from, to)
12: .mapIdentifiers_map(ids, type[[1]], type[[2]], verbose)
11: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
10: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
9: eval(call, parent.frame())
8: eval(call, parent.frame())
7: callGeneric(what, to, from = geneIdType(what), ..., verbose = verbose)
6: FUN(X[[i]], ...)
5: FUN(X[[i]], ...)
4: lapply(what, mapIdentifiers, to, ..., verbose = verbose)
3: GeneSetCollection(lapply(what, mapIdentifiers, to, ..., verbose = verbose))
2: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
1: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))

This error has been already reported in the past in Biostars in this thread, in the context of using the function annotate::getSYMBOL() with an EnsDb.* package. In both cases, the problem seems to be associated with the function annotate::getAnnMap(), but maybe there is a solution within GSEABase.

Faster `incidence()`

Coincidentally, I had written a faster version of incidence as part of another function that makes use of matrix multiplication. The key is to only define the positions of nonzero values, since these are less frequent, though it does add an extra Matrix import for sparse matrices. This could be rewritten to use regular matrices by starting with a matrix of zeros and using x.idx to replace specific values with 1L; however, I have not tried this, and there may be better practices for dealing with matrices.

library(Matrix)

# y is a "GeneSet" or "GeneSetCollection" object
fast_incidence <- function(y) {
  y <- geneIds(y) # named list of gene sets
  x <- stack(y, drop = TRUE)[, 2L:1L] # columns "ind" (set identifier) and "values" (genes)

  # Convert identifiers to indices (positions of values in sparseMatrix)
  mat.dimnames <- lapply(x, unique)
  x.idx <- mapply(match, x, mat.dimnames)

  mat <- sparseMatrix(i = x.idx[, 1L], j = x.idx[, 2L], x = 1L, dimnames = mat.dimnames)

  return(mat)
}

By flipping the columns of the stack results, we end up with a sparseMatrix with genes as columns and set identifiers as rows. When I was testing this with sets from msigdb (v1.10.0), fast_incidence was ~25x faster (only took about 2 seconds compared to ~50). It also only used 30.2 MB compared to the 2 GB allocated to the incidence results.

goSlim() counting differs from intersect()

When using the goSlim() function, the results will include counts for GO terms in a collection that also match "parent" terms.

The intersect() function does not include GO terms in a collection which also match "parent" terms.

Here's what I've done to identify this issue.

If you run the example code for the goSlim() function, a data frame is returned.

# Save goSlim as data frame
slimdf <- goSlim(myCollection, slim, "MF")

Then, get a list of GOslim IDs and their corresponding GO IDs:

# Create list of GOslims and corresponding GO IDs
gomap.list <- as.list(GOMFOFFSPRING[rownames(slimdf)])

The next step is where things don't seem to work the way I expected (based on how the goSlim() function works):

# Identify GO IDs in myCollection which are also found in the GO slim mappings
mapped.list <- lapply(gomap, intersect, ids(myCollection))

# View list structure
str(mapped.list)

Output (truncated for brevity):

List of 24
 $ GO:0000166: chr(0) 
 $ GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...
 $ GO:0003676: chr "GO:0003677"
 $ GO:0003677: chr(0)

The issue can be seen in the output above.

GO:0003677 (which is in the myIDs vector in the example code) is mapped (i.e. intersects) to the following GOslim IDs:

  • GO:0003674
  • GO:0003676

However, GO:0003677 does not get identified as an intersection with itself:

$ GO:0003677: chr(0)

I am expecting the output to look like this:

List of 24
 $ GO:0000166: chr(0) 
 $ GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...
 $ GO:0003676: chr "GO:0003677"
 $ GO:0003677: chr "GO:0003677"

Is this the expected behavior?

It seems like a small inconsistency that the intersect() function fails to make this match, but the goSlim() function catches it.

Vignette FIXMEs

I have recently noticed that the vignette has the date hard coded (line 41), and indeed the last commit done in the vignette seems to be from Oct 19, 2013.

It also has some warnings and messages after the title (see the preliminaries under line 51):

> ## FIXME: <adjMat> adjacency matrix -- color w. +/- 1
> ## FIXME: limma topTable --> GeneColorSet
> ## w. verbose=TRUE
> library(GSEABase)
> library(hgu95av2.db)
> library(GO.db)

In total there are 6 FIXME displaying in the vignette.

Long GOslim descriptions are truncated

Here's an example output from the goSlim method:

           Count     Percent                                   Term
GO:0000003     2  0.16934801                           reproduction
GO:0000278     0  0.00000000                     mitotic cell cycle
GO:0000902     0  0.00000000                     cell morphogenesis
GO:0002376    29  2.45554615                  immune system process
GO:0003013     1  0.08467401             circulatory system process
GO:0005975     0  0.00000000         carbohydrate metabolic process
GO:0006091     0  0.00000000 generation of precursor metabolites...
GO:0006259     7  0.59271804                  DNA metabolic process
GO:0006397     1  0.08467401                        mRNA processing
GO:0006399     0  0.00000000                 tRNA metabolic process
GO:0006412     0  0.00000000                            translation
GO:0006457     0  0.00000000                        protein folding
GO:0006464     2  0.16934801 cellular protein modification proce...
GO:0006520     8  0.67739204 cellular amino acid metabolic proce...
GO:0006605     1  0.08467401                      protein targeting
GO:0006629     2  0.16934801                lipid metabolic process
GO:0006790     0  0.00000000      sulfur compound metabolic process
GO:0006810     5  0.42337003                              transport
GO:0006913     0  0.00000000            nucleocytoplasmic transport
GO:0006914     9  0.76206605                              autophagy
GO:0006950    31  2.62489416                     response to stress
GO:0007005     2  0.16934801             mitochondrion organization
GO:0007009     0  0.00000000           plasma membrane organization
GO:0007010     2  0.16934801              cytoskeleton organization
GO:0007034     0  0.00000000                     vacuolar transport
GO:0007049     2  0.16934801                             cell cycle
GO:0007059     0  0.00000000                 chromosome segregation
GO:0007155    37  3.13293819                          cell adhesion
GO:0007165    20  1.69348010                    signal transduction
GO:0007267     4  0.33869602                    cell-cell signaling
GO:0007568     1  0.08467401                                  aging
GO:0008150   634 53.68331922                     biological_process
GO:0008219    26  2.20152413                             cell death
GO:0008283     3  0.25402202          cell population proliferation
GO:0009056    17  1.43945809                      catabolic process
GO:0009058     4  0.33869602                   biosynthetic process
GO:0009790    15  1.27011008                     embryo development
GO:0015031     1  0.08467401                      protein transport
GO:0015979     0  0.00000000                         photosynthesis
GO:0016192     3  0.25402202             vesicle-mediated transport
GO:0019748     0  0.00000000            secondary metabolic process
GO:0021700     0  0.00000000               developmental maturation
GO:0022607     7  0.59271804            cellular component assembly
GO:0022618     1  0.08467401     ribonucleoprotein complex assembly
GO:0030154    85  7.19729043                   cell differentiation
GO:0030198     5  0.42337003      extracellular matrix organization
GO:0030705     0  0.00000000 cytoskeleton-dependent intracellula...
GO:0032196     0  0.00000000                          transposition
GO:0034330     6  0.50804403             cell junction organization
GO:0034641     9  0.76206605 cellular nitrogen compound metaboli...
GO:0034655     1  0.08467401 nucleobase-containing compound cata...
GO:0040007     1  0.08467401                                 growth
GO:0040011     9  0.76206605                             locomotion
GO:0042254     0  0.00000000                    ribosome biogenesis
GO:0042592     8  0.67739204                    homeostatic process
GO:0043473     0  0.00000000                           pigmentation
GO:0044281     8  0.67739204       small molecule metabolic process
GO:0044403     0  0.00000000                      symbiotic process
GO:0048646     4  0.33869602 anatomical structure formation invo...
GO:0048856   148 12.53175275       anatomical structure development
GO:0048870     9  0.76206605                          cell motility
GO:0050877     3  0.25402202                 nervous system process
GO:0051276     7  0.59271804                chromosome organization
GO:0051301     0  0.00000000                          cell division
GO:0051604     0  0.00000000                     protein maturation
GO:0055085     0  0.00000000                transmembrane transport
GO:0061024     0  0.00000000                  membrane organization
GO:0065003     1  0.08467401    protein-containing complex assembly
GO:0071554     0  0.00000000 cell wall organization or biogenesi...
GO:0071941     0  0.00000000       nitrogen cycle metabolic process
GO:0140014     0  0.00000000               mitotic nuclear division

Some of the GOslims are truncated (e.g. generation of precursor metabolites...). I'd like to have the full description (e.g. generation of precursor metabolites and energy).

Is there a way to prevent this from happening? I've looked for some arguments to set the number of characters, but can't find any.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.