Git Product home page Git Product logo

msigdbr's Introduction

msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format

CRAN CRAN downloads R-CMD-check codecov

Overview

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

  • in an R-friendly "tidy" format with one gene pair per row
  • for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
  • as gene symbols as well as NCBI Entrez and Ensembl IDs
  • without accessing external resources and requiring an active internet connection

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

The package data can be accessed using the msigdbr() function, which returns a data frame of gene sets and their member genes. For example, you can retrieve mouse genes from the C2 (curated) CGP (chemical and genetic perturbations) gene sets.

library(msigdbr)
genesets = msigdbr(species = "mouse", category = "C2", subcategory = "CGP")

Check the documentation website for more information.

msigdbr's People

Contributors

igordot avatar smped avatar vreuter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msigdbr's Issues

C2:CGP pathways being labelled as C2:CP pathways

Hi!

thanks for the great software.

I'm using the latest version (v7.5.1), and when I load the gene sets as such: sets <- as.data.frame(msigdbr(species = "Homo sapiens")), I have seen that there are some pathways labelled with gs_subcat == CP but which are actually from gs_subcat == CGP. One example are pathways from the NABA study: NABA_BASEMENT_MEMBRANES, NABA_MATRISOME, etc. These are listed as CGP pathways in the MSigDB website (e.g.: https://www.gsea-msigdb.org/gsea/msigdb/cards/NABA_MATRISOME).

What do you think could have happened for these CGP pathways to end up labelled as CP within msigdbr and do you think you can provide a solution for this?

thanks a lot!

Adding the "EXACT_SOURCE" column to the MsigDB C5 entries

Thanks for the very useful package,
would it be possible to add the EXACT_SOURCE attribute to GENESET record attributes for msigdb C5 gene sets? It would make it much easier to convert msigdb accession numbers into GO IDs. Thanks!

Ensembl Gene IDs

Are Ensembl gene sets supported?

I have just started using msigdbr and I cannot find any in the gene sets I have seen so far

Thanks!

Run KEGG in Seurat object

@igordot @smped @vreuter @actions-user

Hello msigdbr team,

I am running GSEA analysis in 10X spatial and scRNA-seq data and I would like to use KEGG dataset.
Which function/category should I run?
For Hallmark, I run m_df<- msigdbr(species = "Homo sapiens", category = "H")

but category = "KEGG" does not work. I would greatly appreciate your advice.

Thank you.

getting error

Hello and thank you for your work,

I have this piece of code

library(msigdbr)

all_gene_sets <- msigdbr(species = "Mus musculus")
head(all_gene_sets)

but I am having the following error:

Error in parse(text = elt): <text>:1:5: simbolo inatteso
1: Use of
        ^
Traceback:

1. msigdbr(species = "Mus musculus")
2. orthologs(genes = genesets_subset$human_ensembl_gene, species = species) %>% 
 .     select(-any_of(c("human_symbol", "human_entrez"))) %>% rename(human_ensembl_gene = .data$human_ensembl, 
 .     gene_symbol = .data$symbol, entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
3. rename(., human_ensembl_gene = .data$human_ensembl, gene_symbol = .data$symbol, 
 .     entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
4. rename.data.frame(., human_ensembl_gene = .data$human_ensembl, 
 .     gene_symbol = .data$symbol, entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
5. tidyselect::eval_rename(expr(c(...)), .data)
6. rename_impl(data, names(data), as_quosure(expr, env), strict = strict, 
 .     name_spec = name_spec, allow_predicates = allow_predicates, 
 .     error_call = error_call)
7. eval_select_impl(x, names, {
 .     {
 .         sel
 .     }
 . }, strict = strict, name_spec = name_spec, type = "rename", allow_predicates = allow_predicates, 
 .     error_call = error_call)
8. with_subscript_errors(out <- vars_select_eval(vars, expr, strict = strict, 
 .     data = x, name_spec = name_spec, uniquely_named = uniquely_named, 
 .     allow_rename = allow_rename, allow_empty = allow_empty, allow_predicates = allow_predicates, 
 .     type = type, error_call = error_call), type = type)
9. try_fetch(expr, vctrs_error_subscript = function(cnd) {
 .     cnd$subscript_action <- subscript_action(type)
 .     cnd$subscript_elt <- "column"
 .     cnd_signal(cnd)
 . })
10. withCallingHandlers(expr, vctrs_error_subscript = function(cnd) {
  .     {
  .         .__handler_frame__. <- TRUE
  .         .__setup_frame__. <- frame
  .     }
  .     out <- handlers[[1L]](cnd)
  .     if (!inherits(out, "rlang_zap")) 
  .         throw(out)
  . })
11. vars_select_eval(vars, expr, strict = strict, data = x, name_spec = name_spec, 
  .     uniquely_named = uniquely_named, allow_rename = allow_rename, 
  .     allow_empty = allow_empty, allow_predicates = allow_predicates, 
  .     type = type, error_call = error_call)
12. walk_data_tree(expr, data_mask, context_mask)
13. eval_c(expr, data_mask, context_mask)
14. reduce_sels(node, data_mask, context_mask, init = init)
15. walk_data_tree(new, data_mask, context_mask)
16. expr_kind(expr, context_mask, error_call)
17. call_kind(expr, context_mask, error_call)
18. lifecycle::deprecate_soft("1.2.0", what, details = cli::format_inline("Please use {.code {str}} instead of `.data${var}`"), 
  .     user_env = env)
19. signal_stage("deprecated", what)
20. spec(what, env = env)
21. spec_what(spec, "spec", signaller)
22. parse_expr(what)
23. parse_exprs(x)
24. chr_parse_exprs(x)
25. map(x, function(elt) as.list(parse(text = elt)))
26. lapply(.x, .f, ...)
27. FUN(X[[i]], ...)
28. as.list(parse(text = elt))
29. parse(text = elt)

Could you provide help to solve this issue?
Thank you in advance

msigdbr package, category C2, subcategory CP

Hello,
I'm currently running a gsea using msigdbr package.
I've noticed that subcategory CP of category C2 only contains 29 gene sets as displayed by msigdbr(collections), whereas this subcategory should include all of the depending gene sets (KEGG, reactome, wikipthways,...) and originally contains 2982 gene sets, as detailed on the original website : http://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP

Any recommendations to run all of these gene sets depending on CP subcategory?

Thank you!

 

Problem with dyplr dependency (I think)

I am getting this error when trying to use msigdbr:

`> msigdbr(species = "Homo sapiens")
Error in `select()`:
! <text>:1:5: unexpected symbol
1: Use of
        ^
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<simpleError in select(., .data$human_ensembl_gene, gene_symbol = .data$human_gene_symbol,     entrez_gene = .data$human_entrez_gene): <text>:1:5: unexpected symbol
1: Use of
        ^>`

session info:

`> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] EnrichmentBrowser_2.26.0    graph_1.74.0               
 [3] SummarizedExperiment_1.26.1 Biobase_2.56.0             
 [5] GenomicRanges_1.48.0        GenomeInfoDb_1.32.4        
 [7] IRanges_2.30.1              S4Vectors_0.34.0           
 [9] BiocGenerics_0.42.0         MatrixGenerics_1.8.1       
[11] matrixStats_0.63.0          msigdbr_7.5.1              
[13] fgsea_1.22.0                biomaRt_2.52.0             
[15] dplyr_1.0.10                clusterProfiler_4.4.4      `

Any ideas...?

Update to MSIGDB

Hello!

I was wondering if there were plans to synchronize msigdbr with the latest release of MSIGDB (aug 2019)? The new MSIGDB has added and removed hundreds of gene sets so I've been finding that the information pages for most of my top GSEA hits using msgidbr annotations no longer exist.

Thank you for your time!
Best,
Henry

Inconsistent gene set contents with MSigDB

First, thanks for the great package! It's really convenient to be able to pull in these gene sets from MSigDB. I've been using it to pull gene sets for about a year now, and only recently noticed that some of the gene sets are different than what's on MSigDB (e.g., GOBP_Keratinization from msigdbr includes 279 genes, but on MSigDB it only has 83 genes).

I thought it might be a difference of versions (as msigdbr pulls MSigDB 7.5.1), but GOBP_Keratinization actually contains fewer genes in this version (n = 59): https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/c5.go.bp.v7.5.1.symbols.gmt

I used this line to pull all GO BP sets:

m_df_BP = msigdbr(species = "Homo sapiens",subcategory=c("BP"))

here is my session info:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices
[6] datasets utils methods base

other attached packages:
[1] scales_1.1.1 msigdbr_7.4.1
[3] biomartr_0.9.2 data.table_1.14.0
[5] GSEABase_1.54.0 graph_1.70.0
[7] annotate_1.70.0 XML_3.99-0.6
[9] reactome.db_1.76.0 GO.db_3.13.0
[11] fgsea_1.18.0 dplyr_1.0.7
[13] EnhancedVolcano_1.10.0 ggrepel_0.9.1
[15] rlist_0.4.6.1 pheatmap_1.0.12
[17] org.Hs.eg.db_3.13.0 AnnotationDbi_1.54.1
[19] readxl_1.3.1 ggplot2_3.3.5
[21] ashr_2.2-47 DESeq2_1.32.0
[23] SummarizedExperiment_1.22.0 Biobase_2.52.0
[25] MatrixGenerics_1.4.0 matrixStats_0.59.0
[27] GenomicRanges_1.44.0 GenomeInfoDb_1.28.1
[29] IRanges_2.26.0 S4Vectors_0.30.0
[31] BiocGenerics_0.38.0 rmarkdown_2.14
[33] here_1.0.1

loaded via a namespace (and not attached):
[1] snow_0.4-3 circlize_0.4.14
[3] fastmatch_1.1-3 BiocFileCache_2.0.0
[5] splines_4.1.0 BiocParallel_1.26.1
[7] digest_0.6.27 invgamma_1.1
[9] foreach_1.5.2 htmltools_0.5.2
[11] SQUAREM_2021.1 fansi_0.5.0
[13] magrittr_2.0.1 memoise_2.0.0
[15] cluster_2.1.2 doParallel_1.0.17
[17] ComplexHeatmap_2.8.0 Biostrings_2.60.1
[19] extrafont_0.17 extrafontdb_1.0
[21] prettyunits_1.1.1 colorspace_2.0-2
[23] rappdirs_0.3.3 blob_1.2.2
[25] xfun_0.30 crayon_1.4.1
[27] RCurl_1.98-1.3 genefilter_1.74.0
[29] survival_3.3-1 iterators_1.0.14
[31] glue_1.6.2 gtable_0.3.0
[33] zlibbioc_1.38.0 XVector_0.32.0
[35] GetoptLong_1.0.5 DelayedArray_0.18.0
[37] proj4_1.0-10.1 Rttf2pt1_1.3.9
[39] shape_1.4.6 maps_3.3.0
[41] DBI_1.1.1 Rcpp_1.0.7
[43] progress_1.2.2 xtable_1.8-4
[45] clue_0.3-60 bit_4.0.4
[47] truncnorm_1.0-8 httr_1.4.2
[49] RColorBrewer_1.1-2 ellipsis_0.3.2
[51] pkgconfig_2.0.3 farver_2.1.0
[53] dbplyr_2.1.1 locfit_1.5-9.4
[55] utf8_1.2.1 tidyselect_1.1.1
[57] labeling_0.4.2 rlang_0.4.11
[59] munsell_0.5.0 cellranger_1.1.0
[61] tools_4.1.0 cachem_1.0.5
[63] cli_3.3.0 generics_0.1.0
[65] RSQLite_2.2.7 evaluate_0.14
[67] stringr_1.4.0 fastmap_1.1.0
[69] yaml_2.2.1 babelgene_21.4
[71] knitr_1.33 bit64_4.0.5
[73] purrr_0.3.4 KEGGREST_1.32.0
[75] ash_1.0-15 ggrastr_0.2.3
[77] xml2_1.3.2 biomaRt_2.48.2
[79] compiler_4.1.0 rstudioapi_0.13
[81] filelock_1.0.2 curl_4.3.2
[83] beeswarm_0.4.0 png_0.1-8
[85] tibble_3.1.3 geneplotter_1.70.0
[87] stringi_1.7.3 highr_0.10
[89] ggalt_0.4.0 lattice_0.20-45
[91] Matrix_1.3-4 vctrs_0.3.8
[93] pillar_1.6.1 lifecycle_1.0.0
[95] BiocManager_1.30.16 GlobalOptions_0.1.2
[97] bitops_1.0-7 irlba_2.3.3
[99] R6_2.5.0 renv_0.15.4
[101] KernSmooth_2.23-20 gridExtra_2.3
[103] vipor_0.4.5 codetools_0.2-19
[105] MASS_7.3-55 assertthat_0.2.1
[107] rprojroot_2.0.2 rjson_0.2.21
[109] withr_2.4.2 GenomeInfoDbData_1.2.6
[111] hms_1.1.0 grid_4.1.0
[113] Cairo_1.5-12.2 mixsqp_0.3-43
[115] tinytex_0.37 ggbeeswarm_0.6.0

Problem with loading several categories

In our work we often want to test our gene lists against several categories of gene sets at once.
Until now we would load the gene sets like this:

msigdb.genes.sets <-msigdbr(species="Homo sapiens", category=c("H","C2"))

We noticed that in doing so, the gene sets are truncated, with a remaining number of genes in a gene set varying with the number of categories or their order.
After looking at the R code it seems the problem is that the categories are filtered with an "==" and not a "%in%, which means we cannot use an array in our command. But no warning or error is thrown and everything downstream works, with background ratio values wrong obviously.

Would it be possible to correct this or to forbid requesting more than one category in the command?

Some orthologs are missing

Hi,

I am trying to use msigdbr for a GSEA analysis for the GENESET - HSF1_01 in MSigDB.

Now this geneset contains a gene SHFM3 in MSigDB but it is missing in your list of orthologs for the same geneset.

I did a search for this gene - https://uswest.ensembl.org/Multi/Search/Results?q=SHFM3;site=ensembl

And found out that this gene has an alias/synonym - FBXW4 (as shown here - > https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000107829;r=10:101610664-101695295 )

This particular alias (FBXW4) does have ORTHOLOG information for mus musculus (Fbxw4) as shown at - https://uswest.ensembl.org/Homo_sapiens/Gene/Compara_Ortholog?db=core;g=ENSG00000107829;r=10:101610664-101695295

There are many such cases and I was wondering if that is intentional or could be fixed in the future releases?

Much appreciate!

Ashu

No gene sets from KEGG, REACTOME or BIOCARTA

It looks like it's no longer possible to get gene sets from KEGG, REACTOME or BIOCARTA:

c2_reactome <- msigdbr(category = "C2", subcategory = "REACTOME") %>%
  split(x = .$gene_symbol, f = .$gs_name)
> length(c2_reactome)
[1] 0

Can these be restored? Thank you.

2023 update?

Thank you for developing this useful tool. Do you have any plans to update it based on the 2023 release of MSigDB?

Add shorter GO descriptions?

The entries in the gs_description column for GO terms are rather long and not ideal for use as human-readable identifiers when plotting ORA or GSEA results. Would it be possible to add a gs_brief_description column that uses the names from the appropriate GO database release? I have been getting the data using the code below and then left-joining it to ORA and GSEA results tables made with fgsea. For other databases, I just use the entries in gs_description.

# install.packages(c("ontologyIndex", "dplyr"))
library(ontologyIndex)
library(dplyr)

# Brief GO term descriptions (use same data from MSigDB release notes)
file <- "http://release.geneontology.org/2021-12-15/ontology/go-basic.obo"
go_basic_list <- get_OBO(file,
                         propagate_relationships = "is_a",
                         extract_tags = "minimal")

# Convert to data.frame with fewer columns
go_basic_df <- as.data.frame(go_basic_list) %>%
  filter(!obsolete) %>%
  select(pathway = id, name)

Methodology details, and `write.gmt` helper functions?

Hi I came across your package which could potentially save me a lot of work so I thank you.

Could you publish the details on your methods for converting between human to X species? I need this information in order to be able to cite you in my research.

Also will you consider adding helper functions to convert from the data.frame types to a type which can be easily written as a .gmt pathway file?

Retrieve all C2 canonical pathways using option subcategory = "CP"?

Dear Igordot,

Thanks for this wonderful tool! I understand it can be used to retrieve subcategory pathways by setting subcategory = "CP:KEGG". But I was wondering if I can extract all canonical pathways as follows:

library(msigdbr)
m_df = msigdbr(species = "Homo sapiens", category = 'C2', subcategory = 'CP')
length(unique(m_df$gs_name))
[1] 29

Looking forward to your comments!

Best,
Lei

`unused argument (.data$species_name == species)` error

Hi,
I've just got unused argument (.data$species_name == species) error, and I don't know how to proceed. Is it a bug or am I doing sth wrong?

> library(msigdbr)
> msigdbr(species = "Homo sapiens")
Error in filter.tbl(msigdbr_orthologs, .data$species_name == species) : 
  unused argument (.data$species_name == species)
> msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
Error in filter.tbl(msigdbr_orthologs, .data$species_name == species) : 
  unused argument (.data$species_name == species)
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/R-4.0.2/lib64/R/lib/libRblas.so
LAPACK: /opt/R-4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] msigdbr_7.2.1               DESeq2_1.28.1              
 [3] SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
 [5] matrixStats_0.57.0          Biobase_2.48.0             
 [7] rtracklayer_1.48.0          genomation_1.20.0          
 [9] gProfileR_0.7.0             ChIPpeakAnno_3.22.4        
[11] Biostrings_2.56.0           XVector_0.28.0             
[13] VennDiagram_1.6.20          futile.logger_1.4.3        
[15] rGREAT_1.20.0               methylKit_1.14.2           
[17] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
[19] IRanges_2.22.2              S4Vectors_0.26.1           
[21] BiocGenerics_0.34.0         gprofiler2_0.2.0           
[23] reshape2_1.4.4              ggplot2_3.3.2              
[25] gridExtra_2.3               data.table_1.13.0          
[27] biomaRt_2.44.4              igraph_1.2.6               
[29] STRINGdb_2.0.2             

loaded via a namespace (and not attached):
  [1] circlize_0.4.10          BiocFileCache_1.12.1     plyr_1.8.6              
  [4] lazyeval_0.2.2           splines_4.0.2            BiocParallel_1.22.0     
  [7] gridBase_0.4-7           digest_0.6.25            ensembldb_2.12.1        
 [10] htmltools_0.5.0          GO.db_3.11.4             magrittr_1.5            
 [13] memoise_1.1.0            BSgenome_1.56.0          limma_3.44.3            
 [16] annotate_1.66.0          readr_1.4.0              R.utils_2.10.1          
 [19] askpass_1.1              bdsmatrix_1.3-4          prettyunits_1.1.1       
 [22] colorspace_1.4-1         blob_1.2.1               rappdirs_0.3.1          
 [25] xfun_0.18                dplyr_1.0.2              crayon_1.3.4            
 [28] RCurl_1.98-1.2           jsonlite_1.7.1           graph_1.66.0            
 [31] genefilter_1.70.0        impute_1.62.0            survival_3.1-12         
 [34] glue_1.4.2               hash_2.2.6.1             gtable_0.3.0            
 [37] zlibbioc_1.34.0          seqinr_4.2-4             GetoptLong_1.0.3        
 [40] shape_1.4.5              scales_1.1.1             futile.options_1.0.1    
 [43] mvtnorm_1.1-1            DBI_1.1.0                Rcpp_1.0.5              
 [46] plotrix_3.7-8            xtable_1.8-4             viridisLite_0.3.0       
 [49] progress_1.2.2           emdbook_1.3.12           bit_4.0.4               
 [52] mclust_5.4.6             sqldf_0.4-11             htmlwidgets_1.5.2       
 [55] httr_1.4.2               gplots_3.1.0             RColorBrewer_1.1-2      
 [58] ellipsis_0.3.1           pkgconfig_2.0.3          XML_3.99-0.5            
 [61] R.methodsS3_1.8.1        farver_2.0.3             dbplyr_1.4.4            
 [64] locfit_1.5-9.4           tidyselect_1.1.0         labeling_0.3            
 [67] rlang_0.4.7              AnnotationDbi_1.50.3     munsell_0.5.0           
 [70] tools_4.0.2              gsubfn_0.7               generics_0.0.2          
 [73] RSQLite_2.2.1            ade4_1.7-15              fastseg_1.34.0          
 [76] evaluate_0.14            stringr_1.4.0            yaml_2.2.1              
 [79] knitr_1.30               bit64_4.0.5              caTools_1.18.0          
 [82] purrr_0.3.4              AnnotationFilter_1.12.0  RBGL_1.64.0             
 [85] formatR_1.7              R.oo_1.24.0              xml2_1.3.2              
 [88] compiler_4.0.2           rstudioapi_0.11          plotly_4.9.2.1          
 [91] curl_4.3                 png_0.1-7                geneplotter_1.66.0      
 [94] tibble_3.0.3             idr_1.2                  stringi_1.5.3           
 [97] GenomicFeatures_1.40.1   lattice_0.20-41          ProtGenerics_1.20.0     
[100] Matrix_1.2-18            multtest_2.44.0          vctrs_0.3.4             
[103] pillar_1.4.6             lifecycle_0.2.0          BiocManager_1.30.10     
[106] GlobalOptions_0.1.2      bitops_1.0-6             qvalue_2.20.0           
[109] R6_2.4.1                 KernSmooth_2.23-17       lambda.r_1.2.4          
[112] MASS_7.3-51.6            gtools_3.8.2             assertthat_0.2.1        
[115] chron_2.3-56             proto_1.0.0              openssl_1.4.3           
[118] rjson_0.2.20             withr_2.3.0              regioneR_1.20.1         
[121] GenomicAlignments_1.24.0 Rsamtools_2.4.0          GenomeInfoDbData_1.2.3  
[124] hms_0.5.3                tidyr_1.1.2              coda_0.19-4             
[127] rmarkdown_2.4            seqPattern_1.20.0        bbmle_1.0.23.1          
[130] numDeriv_2016.8-1.1      tinytex_0.26

Best,
Kasia

enricher result is different from msigDB web "investigate Gene Sets"

Hi,

Many thanks for the msigdbr package.
Can I ask a question about the result of enricher please?

msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

I am using the code above but I've found the result of enriched msigDB signatures is different from "investigate gene sets" on msigDB website. I thought it's based on the number of the overlapped gene between the user's gene and the background gene in the gene set. But the overlapped gene count from enricher seems smaller than the real overlapped count (i.e. if I use intersect to see how many genes overlapped between mine and the msigdb gene set). Did i misunderstand the function of enricher here? And if possible, how can I get the same results to msigDB web?

Thanks in advance!

Best,
Wei

Save the 'entrez_gene' columns in character mode

First thanks for this great package! Especially it directly outputs three different gene ID types, which saves a lot of time when switching between different gene ID types.

I have a small suggestion. Here in the output table, columns related to "entrez_gene" are stored as integers. I would suggest to change to characters, as what other Bioconducror annotation package does (e.g. org.Hs.eg.db).

gene_sets
# A tibble: 8,209 × 15
   gs_cat gs_su…¹ gs_name gene_…² entre…³ ensem…⁴ human…⁵ human…⁶ human…⁷ gs_id gs_pmid gs_ge…⁸
   <chr>  <chr>   <chr>   <chr>     <int> <chr>   <chr>     <int> <chr>   <chr> <chr>   <chr>  
 1 H      ""      HALLMA… ABCA1        19 ENSG00… ABCA1        19 ENSG00… M5905 267710… ""     
 2 H      ""      HALLMA… ABCB8     11194 ENSG00… ABCB8     11194 ENSG00… M5905 267710… ""     
 3 H      ""      HALLMA… ACAA2     10449 ENSG00… ACAA2     10449 ENSG00… M5905 267710… ""     
 4 H      ""      HALLMA… ACADL        33 ENSG00… ACADL        33 ENSG00… M5905 267710… ""     
 5 H      ""      HALLMA… ACADM        34 ENSG00… ACADM        34 ENSG00… M5905 267710… ""     
 6 H      ""      HALLMA… ACADS        35 ENSG00… ACADS        35 ENSG00… M5905 267710… ""     
 7 H      ""      HALLMA… ACLY         47 ENSG00… ACLY         47 ENSG00… M5905 267710… ""     
 8 H      ""      HALLMA… ACO2         50 ENSG00… ACO2         50 ENSG00… M5905 267710… ""     
 9 H      ""      HALLMA… ACOX1        51 ENSG00… ACOX1        51 ENSG00… M5905 267710… ""     
10 H      ""      HALLMA… ADCY6       112 ENSG00… ADCY6       112 ENSG00… M5905 267710… ""     
# … with 8,199 more rows, 3 more variables: gs_exact_source <chr>, gs_url <chr>,
#   gs_description <chr>, and abbreviated variable names ¹​gs_subcat, ²​gene_symbol,
#   ³​entrez_gene, ⁴​ensembl_gene, ⁵​human_gene_symbol, ⁶​human_entrez_gene, ⁷​human_ensembl_gene,
#   ⁸​gs_geoid
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Imagine we want to convert Entrez IDs to Refseq IDs, and we have a mapping vector (map) where Entrez IDs are the names and Refseq IDs are the values. Then naturally, to convert, we can do:

map[gene_sets$entrez_gene]

This causes the problem because gene_sets$entrez_gene are integers and it is actually treated as numeric indices for the map vector, while not to match to the names in map.

To do it correctly, we need to explicitly convert gene_sets$entrez_gene to characters:

map[as.character(gene_sets$entrez_gene)]

The more severe consequence is, if the maximal numeric value in gene_sets$entrez_gene is smaller than the length of map, executing map[gene_sets$entrez_gene] actually will not generate any warning or error message. And it would generate wrong results silently.

misgdbr for yeast

Hi, I am using the package for yeast GSEA, and I see some enrichments that seem not to be related to yeast, such as:
HP_ADDICTIVE_BEHAVIOR or HP_ACUTE_MYELOID_LEUKEMIA. I am a beginner; could you please tell me if the error is from my end I am doing something wrong?

I used

#get all collections/signatures with yeast
yeast_gsea <- msigdbr(species = "Saccharomyces cerevisiae")
yeast_gsea %>%   dplyr::distinct(gs_cat, gs_subcat) %>%   dplyr::arrange(gs_cat, gs_subcat)
#choose a specific msigdb collection/subcollection
yeast_gsea_c5 <- msigdbr(species = "Saccharomyces cerevisiae", category = "C5") %>% dplyr::select(gs_name, gene_symbol)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.