igordot / msigdbr Goto Github PK

View Code? Open in Web Editor NEW

69.0 9.0 13.0 63.84 MB

MSigDB gene sets for multiple organisms in a tidy data format

Home Page: https://igordot.github.io/msigdbr

License: Other

R 100.00%

genomics msigdb gene-sets pathways gsea pathway-analysis enrichment-analysis

msigdbr's Introduction

msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format

Overview

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

in an R-friendly "tidy" format with one gene pair per row
for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
as gene symbols as well as NCBI Entrez and Ensembl IDs
without accessing external resources and requiring an active internet connection

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

The package data can be accessed using the msigdbr() function, which returns a data frame of gene sets and their member genes. For example, you can retrieve mouse genes from the C2 (curated) CGP (chemical and genetic perturbations) gene sets.

library(msigdbr)
genesets = msigdbr(species = "mouse", category = "C2", subcategory = "CGP")

Check the documentation website for more information.

msigdbr's People

Contributors

Stargazers

Watchers

Forkers

vreuter jchenpku acastanza lianos matthew-paul-2006 neuro-x1 tylersagendorf hyacinthmeng rnaimehaom ozturan zhsh006 jliu678 aakrosh

msigdbr's Issues

C2:CGP pathways being labelled as C2:CP pathways

Hi!

thanks for the great software.

I'm using the latest version (v7.5.1), and when I load the gene sets as such: sets <- as.data.frame(msigdbr(species = "Homo sapiens")), I have seen that there are some pathways labelled with gs_subcat == CP but which are actually from gs_subcat == CGP. One example are pathways from the NABA study: NABA_BASEMENT_MEMBRANES, NABA_MATRISOME, etc. These are listed as CGP pathways in the MSigDB website (e.g.: https://www.gsea-msigdb.org/gsea/msigdb/cards/NABA_MATRISOME).

What do you think could have happened for these CGP pathways to end up labelled as CP within msigdbr and do you think you can provide a solution for this?

thanks a lot!

number of gene sets

Hi
I am using a package version of msigdbr_7.5.1, and wonder why the number of gene sets is smaller than that listed on the website. E.g., msigdbr_collections() says that there are 1615 reactome pathways, but the website https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp?collection=CP:REACTOME says it is 1635? Thanks!

Adding the "EXACT_SOURCE" column to the MsigDB C5 entries

Thanks for the very useful package,
would it be possible to add the EXACT_SOURCE attribute to GENESET record attributes for msigdb C5 gene sets? It would make it much easier to convert msigdb accession numbers into GO IDs. Thanks!

Ensembl Gene IDs

Are Ensembl gene sets supported?

I have just started using msigdbr and I cannot find any in the gene sets I have seen so far

Thanks!

Run KEGG in Seurat object

@igordot @smped @vreuter @actions-user

Hello msigdbr team,

I am running GSEA analysis in 10X spatial and scRNA-seq data and I would like to use KEGG dataset.
Which function/category should I run?
For Hallmark, I run m_df<- msigdbr(species = "Homo sapiens", category = "H")

but category = "KEGG" does not work. I would greatly appreciate your advice.

Thank you.

getting error

Hello and thank you for your work,

I have this piece of code

library(msigdbr)

all_gene_sets <- msigdbr(species = "Mus musculus")
head(all_gene_sets)

but I am having the following error:

Error in parse(text = elt): <text>:1:5: simbolo inatteso
1: Use of
        ^
Traceback:

1. msigdbr(species = "Mus musculus")
2. orthologs(genes = genesets_subset$human_ensembl_gene, species = species) %>% 
 .     select(-any_of(c("human_symbol", "human_entrez"))) %>% rename(human_ensembl_gene = .data$human_ensembl, 
 .     gene_symbol = .data$symbol, entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
3. rename(., human_ensembl_gene = .data$human_ensembl, gene_symbol = .data$symbol, 
 .     entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
4. rename.data.frame(., human_ensembl_gene = .data$human_ensembl, 
 .     gene_symbol = .data$symbol, entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
5. tidyselect::eval_rename(expr(c(...)), .data)
6. rename_impl(data, names(data), as_quosure(expr, env), strict = strict, 
 .     name_spec = name_spec, allow_predicates = allow_predicates, 
 .     error_call = error_call)
7. eval_select_impl(x, names, {
 .     {
 .         sel
 .     }
 . }, strict = strict, name_spec = name_spec, type = "rename", allow_predicates = allow_predicates, 
 .     error_call = error_call)
8. with_subscript_errors(out <- vars_select_eval(vars, expr, strict = strict, 
 .     data = x, name_spec = name_spec, uniquely_named = uniquely_named, 
 .     allow_rename = allow_rename, allow_empty = allow_empty, allow_predicates = allow_predicates, 
 .     type = type, error_call = error_call), type = type)
9. try_fetch(expr, vctrs_error_subscript = function(cnd) {
 .     cnd$subscript_action <- subscript_action(type)
 .     cnd$subscript_elt <- "column"
 .     cnd_signal(cnd)
 . })
10. withCallingHandlers(expr, vctrs_error_subscript = function(cnd) {
  .     {
  .         .__handler_frame__. <- TRUE
  .         .__setup_frame__. <- frame
  .     }
  .     out <- handlers[[1L]](cnd)
  .     if (!inherits(out, "rlang_zap")) 
  .         throw(out)
  . })
11. vars_select_eval(vars, expr, strict = strict, data = x, name_spec = name_spec, 
  .     uniquely_named = uniquely_named, allow_rename = allow_rename, 
  .     allow_empty = allow_empty, allow_predicates = allow_predicates, 
  .     type = type, error_call = error_call)
12. walk_data_tree(expr, data_mask, context_mask)
13. eval_c(expr, data_mask, context_mask)
14. reduce_sels(node, data_mask, context_mask, init = init)
15. walk_data_tree(new, data_mask, context_mask)
16. expr_kind(expr, context_mask, error_call)
17. call_kind(expr, context_mask, error_call)
18. lifecycle::deprecate_soft("1.2.0", what, details = cli::format_inline("Please use {.code {str}} instead of `.data${var}`"), 
  .     user_env = env)
19. signal_stage("deprecated", what)
20. spec(what, env = env)
21. spec_what(spec, "spec", signaller)
22. parse_expr(what)
23. parse_exprs(x)
24. chr_parse_exprs(x)
25. map(x, function(elt) as.list(parse(text = elt)))
26. lapply(.x, .f, ...)
27. FUN(X[[i]], ...)
28. as.list(parse(text = elt))
29. parse(text = elt)

Could you provide help to solve this issue?
Thank you in advance

msigdbr package, category C2, subcategory CP

Hello,
I'm currently running a gsea using msigdbr package.
I've noticed that subcategory CP of category C2 only contains 29 gene sets as displayed by msigdbr(collections), whereas this subcategory should include all of the depending gene sets (KEGG, reactome, wikipthways,...) and originally contains 2982 gene sets, as detailed on the original website : http://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP

Any recommendations to run all of these gene sets depending on CP subcategory?

Thank you!

Problem with dyplr dependency (I think)

I am getting this error when trying to use msigdbr:

`> msigdbr(species = "Homo sapiens")
Error in `select()`:
! <text>:1:5: unexpected symbol
1: Use of
        ^
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<simpleError in select(., .data$human_ensembl_gene, gene_symbol = .data$human_gene_symbol,     entrez_gene = .data$human_entrez_gene): <text>:1:5: unexpected symbol
1: Use of
        ^>`

session info:

`> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] EnrichmentBrowser_2.26.0    graph_1.74.0               
 [3] SummarizedExperiment_1.26.1 Biobase_2.56.0             
 [5] GenomicRanges_1.48.0        GenomeInfoDb_1.32.4        
 [7] IRanges_2.30.1              S4Vectors_0.34.0           
 [9] BiocGenerics_0.42.0         MatrixGenerics_1.8.1       
[11] matrixStats_0.63.0          msigdbr_7.5.1              
[13] fgsea_1.22.0                biomaRt_2.52.0             
[15] dplyr_1.0.10                clusterProfiler_4.4.4      `

Any ideas...?

Accessing Mouse MSigDB Collections

Is there any possibility for the package to support collections that don't correspond to the human collections H, C1, ..., C8? For example accessing MH, M1, ..., M8 listed at the link below?

https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp

Update to MSIGDB

Hello!

I was wondering if there were plans to synchronize msigdbr with the latest release of MSIGDB (aug 2019)? The new MSIGDB has added and removed hundreds of gene sets so I've been finding that the information pages for most of my top GSEA hits using msgidbr annotations no longer exist.

Thank you for your time!
Best,
Henry

Inconsistent gene set contents with MSigDB

First, thanks for the great package! It's really convenient to be able to pull in these gene sets from MSigDB. I've been using it to pull gene sets for about a year now, and only recently noticed that some of the gene sets are different than what's on MSigDB (e.g., GOBP_Keratinization from msigdbr includes 279 genes, but on MSigDB it only has 83 genes).

I thought it might be a difference of versions (as msigdbr pulls MSigDB 7.5.1), but GOBP_Keratinization actually contains fewer genes in this version (n = 59): https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/c5.go.bp.v7.5.1.symbols.gmt

I used this line to pull all GO BP sets:

m_df_BP = msigdbr(species = "Homo sapiens",subcategory=c("BP"))

here is my session info:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices
[6] datasets utils methods base

other attached packages:
[1] scales_1.1.1 msigdbr_7.4.1
[3] biomartr_0.9.2 data.table_1.14.0
[5] GSEABase_1.54.0 graph_1.70.0
[7] annotate_1.70.0 XML_3.99-0.6
[9] reactome.db_1.76.0 GO.db_3.13.0
[11] fgsea_1.18.0 dplyr_1.0.7
[13] EnhancedVolcano_1.10.0 ggrepel_0.9.1
[15] rlist_0.4.6.1 pheatmap_1.0.12
[17] org.Hs.eg.db_3.13.0 AnnotationDbi_1.54.1
[19] readxl_1.3.1 ggplot2_3.3.5
[21] ashr_2.2-47 DESeq2_1.32.0
[23] SummarizedExperiment_1.22.0 Biobase_2.52.0
[25] MatrixGenerics_1.4.0 matrixStats_0.59.0
[27] GenomicRanges_1.44.0 GenomeInfoDb_1.28.1
[29] IRanges_2.26.0 S4Vectors_0.30.0
[31] BiocGenerics_0.38.0 rmarkdown_2.14
[33] here_1.0.1

loaded via a namespace (and not attached):
[1] snow_0.4-3 circlize_0.4.14
[3] fastmatch_1.1-3 BiocFileCache_2.0.0
[5] splines_4.1.0 BiocParallel_1.26.1
[7] digest_0.6.27 invgamma_1.1
[9] foreach_1.5.2 htmltools_0.5.2
[11] SQUAREM_2021.1 fansi_0.5.0
[13] magrittr_2.0.1 memoise_2.0.0
[15] cluster_2.1.2 doParallel_1.0.17
[17] ComplexHeatmap_2.8.0 Biostrings_2.60.1
[19] extrafont_0.17 extrafontdb_1.0
[21] prettyunits_1.1.1 colorspace_2.0-2
[23] rappdirs_0.3.3 blob_1.2.2
[25] xfun_0.30 crayon_1.4.1
[27] RCurl_1.98-1.3 genefilter_1.74.0
[29] survival_3.3-1 iterators_1.0.14
[31] glue_1.6.2 gtable_0.3.0
[33] zlibbioc_1.38.0 XVector_0.32.0
[35] GetoptLong_1.0.5 DelayedArray_0.18.0
[37] proj4_1.0-10.1 Rttf2pt1_1.3.9
[39] shape_1.4.6 maps_3.3.0
[41] DBI_1.1.1 Rcpp_1.0.7
[43] progress_1.2.2 xtable_1.8-4
[45] clue_0.3-60 bit_4.0.4
[47] truncnorm_1.0-8 httr_1.4.2
[49] RColorBrewer_1.1-2 ellipsis_0.3.2
[51] pkgconfig_2.0.3 farver_2.1.0
[53] dbplyr_2.1.1 locfit_1.5-9.4
[55] utf8_1.2.1 tidyselect_1.1.1
[57] labeling_0.4.2 rlang_0.4.11
[59] munsell_0.5.0 cellranger_1.1.0
[61] tools_4.1.0 cachem_1.0.5
[63] cli_3.3.0 generics_0.1.0
[65] RSQLite_2.2.7 evaluate_0.14
[67] stringr_1.4.0 fastmap_1.1.0
[69] yaml_2.2.1 babelgene_21.4
[71] knitr_1.33 bit64_4.0.5
[73] purrr_0.3.4 KEGGREST_1.32.0
[75] ash_1.0-15 ggrastr_0.2.3
[77] xml2_1.3.2 biomaRt_2.48.2
[79] compiler_4.1.0 rstudioapi_0.13
[81] filelock_1.0.2 curl_4.3.2
[83] beeswarm_0.4.0 png_0.1-8
[85] tibble_3.1.3 geneplotter_1.70.0
[87] stringi_1.7.3 highr_0.10
[89] ggalt_0.4.0 lattice_0.20-45
[91] Matrix_1.3-4 vctrs_0.3.8
[93] pillar_1.6.1 lifecycle_1.0.0
[95] BiocManager_1.30.16 GlobalOptions_0.1.2
[97] bitops_1.0-7 irlba_2.3.3
[99] R6_2.5.0 renv_0.15.4
[101] KernSmooth_2.23-20 gridExtra_2.3
[103] vipor_0.4.5 codetools_0.2-19
[105] MASS_7.3-55 assertthat_0.2.1
[107] rprojroot_2.0.2 rjson_0.2.21
[109] withr_2.4.2 GenomeInfoDbData_1.2.6
[111] hms_1.1.0 grid_4.1.0
[113] Cairo_1.5-12.2 mixsqp_0.3-43
[115] tinytex_0.37 ggbeeswarm_0.6.0

Problem with loading several categories

In our work we often want to test our gene lists against several categories of gene sets at once.
Until now we would load the gene sets like this:

msigdb.genes.sets <-msigdbr(species="Homo sapiens", category=c("H","C2"))

We noticed that in doing so, the gene sets are truncated, with a remaining number of genes in a gene set varying with the number of categories or their order.
After looking at the R code it seems the problem is that the categories are filtered with an "==" and not a "%in%, which means we cannot use an array in our command. But no warning or error is thrown and everything downstream works, with background ratio values wrong obviously.

Would it be possible to correct this or to forbid requesting more than one category in the command?

Skip 7.3 CRAN release and go straight to 7.4(?)

It looks like the MSigDB v7.4 signature collection have been released before an msigdbr version for the v7.3 signatures has been pushed to CRAN. Maybe skip a v7.3 msigdbr release and go straight to the v7.4 signatures for the next CRAN release?

Thanks!

Some orthologs are missing

Hi,

I am trying to use msigdbr for a GSEA analysis for the GENESET - HSF1_01 in MSigDB.

Now this geneset contains a gene SHFM3 in MSigDB but it is missing in your list of orthologs for the same geneset.

I did a search for this gene - https://uswest.ensembl.org/Multi/Search/Results?q=SHFM3;site=ensembl

And found out that this gene has an alias/synonym - FBXW4 (as shown here - > https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000107829;r=10:101610664-101695295 )

This particular alias (FBXW4) does have ORTHOLOG information for mus musculus (Fbxw4) as shown at - https://uswest.ensembl.org/Homo_sapiens/Gene/Compara_Ortholog?db=core;g=ENSG00000107829;r=10:101610664-101695295

There are many such cases and I was wondering if that is intentional or could be fixed in the future releases?

Much appreciate!

Ashu

No gene sets from KEGG, REACTOME or BIOCARTA

It looks like it's no longer possible to get gene sets from KEGG, REACTOME or BIOCARTA:

c2_reactome <- msigdbr(category = "C2", subcategory = "REACTOME") %>%
  split(x = .$gene_symbol, f = .$gs_name)
> length(c2_reactome)
[1] 0

Can these be restored? Thank you.

2023 update?

Thank you for developing this useful tool. Do you have any plans to update it based on the 2023 release of MSigDB?

Add shorter GO descriptions?

The entries in the gs_description column for GO terms are rather long and not ideal for use as human-readable identifiers when plotting ORA or GSEA results. Would it be possible to add a gs_brief_description column that uses the names from the appropriate GO database release? I have been getting the data using the code below and then left-joining it to ORA and GSEA results tables made with fgsea. For other databases, I just use the entries in gs_description.

# install.packages(c("ontologyIndex", "dplyr"))
library(ontologyIndex)
library(dplyr)

# Brief GO term descriptions (use same data from MSigDB release notes)
file <- "http://release.geneontology.org/2021-12-15/ontology/go-basic.obo"
go_basic_list <- get_OBO(file,
                         propagate_relationships = "is_a",
                         extract_tags = "minimal")

# Convert to data.frame with fewer columns
go_basic_df <- as.data.frame(go_basic_list) %>%
  filter(!obsolete) %>%
  select(pathway = id, name)

First release of Mouse MSigDB (v2022.1.Mm)

Thanks for making great package.
I was wondering do you have any plan to update new released Mouse MSigDB?

Function to query MSigDB database version used by msigdbr

It would be great to have a function to query MSigDB database version used by msigdbr

Methodology details, and `write.gmt` helper functions?

Hi I came across your package which could potentially save me a lot of work so I thank you.

Could you publish the details on your methods for converting between human to X species? I need this information in order to be able to cite you in my research.

Also will you consider adding helper functions to convert from the data.frame types to a type which can be easily written as a .gmt pathway file?

MSigDB 7.4

It seems MSigDB 7.4 has been released. Are there plans to soon update msigdbr?

Retrieve all C2 canonical pathways using option subcategory = "CP"?

Dear Igordot,

Thanks for this wonderful tool! I understand it can be used to retrieve subcategory pathways by setting subcategory = "CP:KEGG". But I was wondering if I can extract all canonical pathways as follows:

library(msigdbr)
m_df = msigdbr(species = "Homo sapiens", category = 'C2', subcategory = 'CP')
length(unique(m_df$gs_name))
[1] 29

Looking forward to your comments!

Best,
Lei

`unused argument (.data$species_name == species)` error

Hi,
I've just got unused argument (.data$species_name == species) error, and I don't know how to proceed. Is it a bug or am I doing sth wrong?

> library(msigdbr)
> msigdbr(species = "Homo sapiens")
Error in filter.tbl(msigdbr_orthologs, .data$species_name == species) : 
  unused argument (.data$species_name == species)
> msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
Error in filter.tbl(msigdbr_orthologs, .data$species_name == species) : 
  unused argument (.data$species_name == species)
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/R-4.0.2/lib64/R/lib/libRblas.so
LAPACK: /opt/R-4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] msigdbr_7.2.1               DESeq2_1.28.1              
 [3] SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
 [5] matrixStats_0.57.0          Biobase_2.48.0             
 [7] rtracklayer_1.48.0          genomation_1.20.0          
 [9] gProfileR_0.7.0             ChIPpeakAnno_3.22.4        
[11] Biostrings_2.56.0           XVector_0.28.0             
[13] VennDiagram_1.6.20          futile.logger_1.4.3        
[15] rGREAT_1.20.0               methylKit_1.14.2           
[17] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
[19] IRanges_2.22.2              S4Vectors_0.26.1           
[21] BiocGenerics_0.34.0         gprofiler2_0.2.0           
[23] reshape2_1.4.4              ggplot2_3.3.2              
[25] gridExtra_2.3               data.table_1.13.0          
[27] biomaRt_2.44.4              igraph_1.2.6               
[29] STRINGdb_2.0.2             

loaded via a namespace (and not attached):
  [1] circlize_0.4.10          BiocFileCache_1.12.1     plyr_1.8.6              
  [4] lazyeval_0.2.2           splines_4.0.2            BiocParallel_1.22.0     
  [7] gridBase_0.4-7           digest_0.6.25            ensembldb_2.12.1        
 [10] htmltools_0.5.0          GO.db_3.11.4             magrittr_1.5            
 [13] memoise_1.1.0            BSgenome_1.56.0          limma_3.44.3            
 [16] annotate_1.66.0          readr_1.4.0              R.utils_2.10.1          
 [19] askpass_1.1              bdsmatrix_1.3-4          prettyunits_1.1.1       
 [22] colorspace_1.4-1         blob_1.2.1               rappdirs_0.3.1          
 [25] xfun_0.18                dplyr_1.0.2              crayon_1.3.4            
 [28] RCurl_1.98-1.2           jsonlite_1.7.1           graph_1.66.0            
 [31] genefilter_1.70.0        impute_1.62.0            survival_3.1-12         
 [34] glue_1.4.2               hash_2.2.6.1             gtable_0.3.0            
 [37] zlibbioc_1.34.0          seqinr_4.2-4             GetoptLong_1.0.3        
 [40] shape_1.4.5              scales_1.1.1             futile.options_1.0.1    
 [43] mvtnorm_1.1-1            DBI_1.1.0                Rcpp_1.0.5              
 [46] plotrix_3.7-8            xtable_1.8-4             viridisLite_0.3.0       
 [49] progress_1.2.2           emdbook_1.3.12           bit_4.0.4               
 [52] mclust_5.4.6             sqldf_0.4-11             htmlwidgets_1.5.2       
 [55] httr_1.4.2               gplots_3.1.0             RColorBrewer_1.1-2      
 [58] ellipsis_0.3.1           pkgconfig_2.0.3          XML_3.99-0.5            
 [61] R.methodsS3_1.8.1        farver_2.0.3             dbplyr_1.4.4            
 [64] locfit_1.5-9.4           tidyselect_1.1.0         labeling_0.3            
 [67] rlang_0.4.7              AnnotationDbi_1.50.3     munsell_0.5.0           
 [70] tools_4.0.2              gsubfn_0.7               generics_0.0.2          
 [73] RSQLite_2.2.1            ade4_1.7-15              fastseg_1.34.0          
 [76] evaluate_0.14            stringr_1.4.0            yaml_2.2.1              
 [79] knitr_1.30               bit64_4.0.5              caTools_1.18.0          
 [82] purrr_0.3.4              AnnotationFilter_1.12.0  RBGL_1.64.0             
 [85] formatR_1.7              R.oo_1.24.0              xml2_1.3.2              
 [88] compiler_4.0.2           rstudioapi_0.11          plotly_4.9.2.1          
 [91] curl_4.3                 png_0.1-7                geneplotter_1.66.0      
 [94] tibble_3.0.3             idr_1.2                  stringi_1.5.3           
 [97] GenomicFeatures_1.40.1   lattice_0.20-41          ProtGenerics_1.20.0     
[100] Matrix_1.2-18            multtest_2.44.0          vctrs_0.3.4             
[103] pillar_1.4.6             lifecycle_0.2.0          BiocManager_1.30.10     
[106] GlobalOptions_0.1.2      bitops_1.0-6             qvalue_2.20.0           
[109] R6_2.4.1                 KernSmooth_2.23-17       lambda.r_1.2.4          
[112] MASS_7.3-51.6            gtools_3.8.2             assertthat_0.2.1        
[115] chron_2.3-56             proto_1.0.0              openssl_1.4.3           
[118] rjson_0.2.20             withr_2.3.0              regioneR_1.20.1         
[121] GenomicAlignments_1.24.0 Rsamtools_2.4.0          GenomeInfoDbData_1.2.3  
[124] hms_0.5.3                tidyr_1.1.2              coda_0.19-4             
[127] rmarkdown_2.4            seqPattern_1.20.0        bbmle_1.0.23.1          
[130] numDeriv_2016.8-1.1      tinytex_0.26

Best,
Kasia

enricher result is different from msigDB web "investigate Gene Sets"

Hi,

Many thanks for the msigdbr package.
Can I ask a question about the result of enricher please?

msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

I am using the code above but I've found the result of enriched msigDB signatures is different from "investigate gene sets" on msigDB website. I thought it's based on the number of the overlapped gene between the user's gene and the background gene in the gene set. But the overlapped gene count from enricher seems smaller than the real overlapped count (i.e. if I use intersect to see how many genes overlapped between mine and the msigdb gene set). Did i misunderstand the function of enricher here? And if possible, how can I get the same results to msigDB web?

Thanks in advance!

Best,
Wei

Save the 'entrez_gene' columns in character mode

First thanks for this great package! Especially it directly outputs three different gene ID types, which saves a lot of time when switching between different gene ID types.

I have a small suggestion. Here in the output table, columns related to "entrez_gene" are stored as integers. I would suggest to change to characters, as what other Bioconducror annotation package does (e.g. org.Hs.eg.db).

gene_sets
# A tibble: 8,209 × 15
   gs_cat gs_su…¹ gs_name gene_…² entre…³ ensem…⁴ human…⁵ human…⁶ human…⁷ gs_id gs_pmid gs_ge…⁸
   <chr>  <chr>   <chr>   <chr>     <int> <chr>   <chr>     <int> <chr>   <chr> <chr>   <chr>  
 1 H      ""      HALLMA… ABCA1        19 ENSG00… ABCA1        19 ENSG00… M5905 267710… ""     
 2 H      ""      HALLMA… ABCB8     11194 ENSG00… ABCB8     11194 ENSG00… M5905 267710… ""     
 3 H      ""      HALLMA… ACAA2     10449 ENSG00… ACAA2     10449 ENSG00… M5905 267710… ""     
 4 H      ""      HALLMA… ACADL        33 ENSG00… ACADL        33 ENSG00… M5905 267710… ""     
 5 H      ""      HALLMA… ACADM        34 ENSG00… ACADM        34 ENSG00… M5905 267710… ""     
 6 H      ""      HALLMA… ACADS        35 ENSG00… ACADS        35 ENSG00… M5905 267710… ""     
 7 H      ""      HALLMA… ACLY         47 ENSG00… ACLY         47 ENSG00… M5905 267710… ""     
 8 H      ""      HALLMA… ACO2         50 ENSG00… ACO2         50 ENSG00… M5905 267710… ""     
 9 H      ""      HALLMA… ACOX1        51 ENSG00… ACOX1        51 ENSG00… M5905 267710… ""     
10 H      ""      HALLMA… ADCY6       112 ENSG00… ADCY6       112 ENSG00… M5905 267710… ""     
# … with 8,199 more rows, 3 more variables: gs_exact_source <chr>, gs_url <chr>,
#   gs_description <chr>, and abbreviated variable names ¹gs_subcat, ²gene_symbol,
#   ³entrez_gene, ⁴ensembl_gene, ⁵human_gene_symbol, ⁶human_entrez_gene, ⁷human_ensembl_gene,
#   ⁸gs_geoid
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Imagine we want to convert Entrez IDs to Refseq IDs, and we have a mapping vector (map) where Entrez IDs are the names and Refseq IDs are the values. Then naturally, to convert, we can do:

map[gene_sets$entrez_gene]

This causes the problem because gene_sets$entrez_gene are integers and it is actually treated as numeric indices for the map vector, while not to match to the names in map.

To do it correctly, we need to explicitly convert gene_sets$entrez_gene to characters:

map[as.character(gene_sets$entrez_gene)]

The more severe consequence is, if the maximal numeric value in gene_sets$entrez_gene is smaller than the length of map, executing map[gene_sets$entrez_gene] actually will not generate any warning or error message. And it would generate wrong results silently.

Archived genesets

Hello!

Is there any way msigdbr could be used to access archived genesets? i.e. those belonging to the "ARCHIVED" collection such as PENG_GLUTAMINE_DEPRIVATION_DN

Best,
Henry

CP:WIKIPATHWAY

How can I retrieve CP:WIKIPATHWAY? (https://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP:WIKIPATHWAYS)

SCSig collection

Dear @igordot

Thanks for the nice package!

Recent MSigDB provides SCSig collection: Signatures of Single Cell Identities
http://software.broadinstitute.org/gsea/msigdb/supplementary_genesets.jsp#SCSig
so I appreciate if you could extend this package to SCSig gene set.

Regards,

Koki

misgdbr for yeast

Hi, I am using the package for yeast GSEA, and I see some enrichments that seem not to be related to yeast, such as:
HP_ADDICTIVE_BEHAVIOR or HP_ACUTE_MYELOID_LEUKEMIA. I am a beginner; could you please tell me if the error is from my end I am doing something wrong?

I used

#get all collections/signatures with yeast
yeast_gsea <- msigdbr(species = "Saccharomyces cerevisiae")
yeast_gsea %>%   dplyr::distinct(gs_cat, gs_subcat) %>%   dplyr::arrange(gs_cat, gs_subcat)
#choose a specific msigdb collection/subcollection
yeast_gsea_c5 <- msigdbr(species = "Saccharomyces cerevisiae", category = "C5") %>% dplyr::select(gs_name, gene_symbol)

Help with checking a pathway in msigdb.

Hi igordot,

m_t2g <- msigdbr(species = "Homo sapiens", category = "C2") %>%
dplyr::select(gs_name, entrez_gene)

Would you please have a look why this pathway: WP_EXTRAFOLLICULAR_AND_FOLLICULAR_B_CELL_ACTIVATION_BY_SARS_COV_2 is not include in m_t2g? This tool is great. Thank you so much!
https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_EXTRAFOLLICULAR_AND_FOLLICULAR_B_CELL_ACTIVATION_BY_SARS_COV_2.html