Git Product home page Git Product logo

clusterprofiler's Introduction

clusterProfiler

Project Status: Active - The project has reached a stable, usable state and is being actively developed. Bioc

platform Build Status Linux/Mac Travis Build Status AppVeyor Build Status codecov

  • clusterProfiler supports exploring functional characteristics of both coding and non-coding genomics data for thousands of species with up-to-date gene annotation.
  • It provides a universal interface for gene functional annotation from a variety of sources and thus can be applied in diverse scenarios.
  • It provides a tidy interface to access, manipulate, and visualize enrichment results to help users achieve efficient data interpretation
  • Datasets obtained from multiple treatments and time points can be analyzed and compared in a single run, easily revealing functional consensus and differences among distinct conditions

For details, please visit https://yulab-smu.top/biomedical-knowledge-mining-book/.

✍️ Authors

Guangchuang YU https://yulab-smu.top

School of Basic Medical Sciences, Southern Medical University

Twitter saythanks


If you use clusterProfiler in published research, please cite the most appropriate paper(s) from this list:

  1. T Wu#, E Hu#, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo*, G Yu*. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141. doi: 10.1016/j.xinn.2021.100141
  2. G Yu*. Gene Ontology Semantic Similarity Analysis Using GOSemSim. In: Kidder B. (eds) Stem Cell Transcriptional Networks. Methods in Molecular Biology. 2020, 2117:207-215. Humana, New York, NY. doi: 10.1007/978-1-0716-0301-7_11
  3. G Yu*. Using meshes for MeSH term enrichment and semantic analyses. Bioinformatics. 2018, 34(21):3766–3767. doi: 10.1093/bioinformatics/bty410
  4. G Yu, QY He*. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems. 2016, 12(2):477-479. doi: 10.1039/C5MB00663E
  5. G Yu*, LG Wang, and QY He*. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics. 2015, 31(14):2382-2383. doi: 10.1093/bioinformatics/btv145
  6. G Yu*, LG Wang, GR Yan, QY He*. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics. 2015, 31(4):608-609. doi: 10.1093/bioinformatics/btu684
  7. G Yu, LG Wang, Y Han and QY He*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287. doi: 10.1089/omi.2011.0118
  8. G Yu, F Li, Y Qin, X Bo*, Y Wu, S Wang*. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010, 26(7):976-978. doi: 10.1093/bioinformatics/btq064

clusterprofiler's People

Contributors

778055611 avatar alexanderpico avatar altairwei avatar amcdavid avatar clearmind777 avatar dalloliogm avatar dtenenba avatar gaospecial avatar gregkoytiger avatar guangchuangyu avatar gwangjinkim avatar hpages avatar huerqiang avatar jigyasa-g avatar jwokaty avatar kevinrue avatar mjchen1996 avatar nelson-gon avatar nturaga avatar pshannon-bioc avatar ryuzheng avatar sonali-bioc avatar vobencha avatar xiangpin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clusterprofiler's Issues

My "geneCluster" works fine with "compareCluster" while using with "fun = "enrichGO" but not with "fun = "enrichKEGG""

My "geneCluster = genes" works fine with function "compareCluster" using "fun = "enrichGO" to create an object while if I used "fun = "enrichKEGG"" instead, it spits an error. Please see below:
compGO <- compareCluster(geneCluster = genes, fun = "enrichGO", pvalueCutoff = 0.05, pAdjustMethod = "BH")
give the following error:

Error inrow.names<-.data.frame(tmp, value = c(NA_character_, NA_character_, : duplicate 'row.names' are not allowed In addition: There were 50 or more warnings (use warnings() to see the first 50)
Could you please help on this issue?
Thanks in advance,
Ajeet

newbie question on gseGO input gene list ranking

Hi there

I am trying to run this with a list of genes and Pvalues. The manual says the input is "order ranked geneList". I thought that the ranking needs to have the most significant gene at the top, etc, but this gives me the following error:

Error in gseAnalyzer(geneList = geneList, setType = ont, organism = organism, :
geneList should be a decreasing sorted vector..

Is the idea to have the genes ranked in decreasing Pvalue (least significant at the top)? The geneList example from DOSE has expression values, rather the Pvalues.

Please advise, and thanks!
Vicky

facet compareCluster result

This is an idea from @dalloliogm

it would be better if compareCluster would return a dataframe with multiple columns, instead of merging them into a single column called Cluster. This would be make it possible to plot the results using facets or something more fancy.

see http://bioinfoblog.it/2015/02/a-formula-interface-for-geneontology-analysis/.

data(geneList, package="DOSE")
mydf <- data.frame(Entrez=names(geneList), FC=geneList)
mydf <- mydf[abs(mydf$FC) > 1,]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"

require(clusterProfiler)
xx <- compareCluster(Entrez~group+othergroup, data=mydf, fun="enrichGO")
require(ggplot2)
## plot(xx), since the parameter x was already taken as input object, 
## we can't use `x` to specify x variable.
## now we prefer using `dotplot` function to produce the same figure.
dotplot(xx)

screenshot 2015-11-03 19 13 06

## we can specify other x variable instead of the default `Cluster` variable.
dotplot(xx,x=~group) + facet_grid(~othergroup)

screenshot 2015-11-03 19 08 31

'enricher' genric enrichment function

This function, for calculating enrichment of customized term/gene sets, is mentioned in the clusterProfiler documentation manual. I do not find it in the version installed from bioconductor, but do find it installing your latest version from github.

install_github(c("GuangchuangYu/DOSE", "GuangchuangYu/clusterProfiler"))
library(clusterProfiler)
help(enricher)

My problem is with the custom genesets. I built a dataframe of 17 genesets with the pathway name in the first column and the gene symbol in the 2nd column:

str(Term2Gene)
'data.frame': 2485 obs. of 2 variables:
$ Pathway : chr "Bmp" "Bmp" "Bmp" "Bmp" ...
$ Gene.symbol: chr "Acvr1" "Acvr2a" "Acvr2b" "Acvrl1" ...
head(Term2Gene)
Pathway Gene.symbol
1 Bmp Acvr1
2 Bmp Acvr2a
3 Bmp Acvr2b
4 Bmp Acvrl1

table(Term2Gene$Pathway)
AdhGPCR Axon Bmp Cams clefting Cytokine ECM2FAC Eicos Fgf
78 198 141 166 568 200 231 42 78
GPCR Hh Hippo Notch RA RTK stretch Wnt
271 71 48 58 24 91 54 166

When I run enricher on a test vector of 65 gene symbols, it finds one pathway enriched. But only 13 of my 17 genesets are in the result. When I run another test vector, it finds 16 of the 17 genesets. In both cases, the largest geneset ('clefting') is missing. Does this indicate that the missing genesets are not tested for enrichment? Or does it indicate some other problem? I'm attaching the Term2Gene dataframe FYI.
Term2Gene.txt

Enblack <-enricher(genes, pvalueCutoff = 0.05, pAdjustMethod = "BH", qvalueCutoff = 1,
universe=unique(unlist(Term2Gene$Gene.symbol)), TERM2GENE=Term2Gene )
names(Enblack@geneSets)
[1] "AdhGPCR" "Axon" "Bmp" "Cams" "Cytokine" "ECM2FAC" "Fgf"
[8] "GPCR" "Hh" "Hippo" "RTK" "stretch" "Wnt"

plotting clusterCompare should have an option to show the pvalues for all comparisons

plotting a clusterCompare object allows to quickly compare the top categories enriched in each group:

> mydf <- data.frame(Entrez=c('1', '100', '1000', '100101467',
                            '100127206', '100128071'),
                  group = c('A', 'A', 'A', 'B', 'B', 'B'))
> enrichment = clusterCompare(Entrez~variable, data=mydf, fun='groupGO')
> plot(enrichment, showCategory=2)

plotting clusterCompare results

This representation can be misleading in certain situations. For example, one may think that group A doesn't contain any gene associated to membrane or cell, and that these terms are not enriched in this category. However, if we look at the enrichment data frame, we see that these two terms have actually an higher GeneRatio for these groups:

> enrichment %>% summary
   Cluster         ID          Description Count GeneRatio     geneID
1        A GO:0016020             membrane     2       2/5   100/1000
2        A GO:0005576 extracellular region     3       3/5 1/100/1000
4        A GO:0005623                 cell     2       2/5   100/1000
12       A GO:0043226            organelle     3       3/5 1/100/1000
23       B GO:0016020             membrane     1       1/3  100127206
24       B GO:0005576 extracellular region     0       0/3           
26       B GO:0005623                 cell     1       1/3  100101467
34       B GO:0043226            organelle     1       1/3  100101467

If we plot this, the situation becomes even more clear. Group A has actually an higher ratio of genes for cell and membrane compared to group B, but since these are not in the top 2 terms, they were not included in the previous plot:

full compareClusters results

In a few minutes I will post a modified version of clusterCompare, including a new paramer for plotting that allows to obtain the second type of plot.

enrichKEGG returning NA

It appears that enrichKEGG is returning NA. This can be repeated using:

data(gcSample)
yy = enrichKEGG(gcSample[[5]], pvalueCutoff=0.01)
print(yy)

compareCluster: List all the enrichKEGG results?

Dear Guangchuang,

Thanks for your great works for this helpful package.

I tried the "compareCluster" command with fun='enrichKEGG'.
I input 2 samples, while got 1 results (compared results, that make sense).

cp = list(a.gene=a[,1], b.gene=b[,1]) 
xx <- compareCluster(cp, fun="enrichKEGG", organism="syz", pvalueCutoff=1)
plot(xx)

I wonder if I could get 2 results, and then I could plot them in one figure with dotplot(xx).
I mean I can run enrichKEGG() twice (one for a[,1], one for b[,1]) and plot the 2 results with dots in 2 figures. Could I dotplot the 2 results in one figure, like your example:

xx.formula <- compareCluster(Entrez~group, data=mydf, fun='enrichKEGG')
plot(xx.formula)

I wish you can understand what i mean for my poor english. Thanks again!
Best wishes,
Peng

clusterprofiler seems to not work for groupGO for organism="ecsakai"!!

Hi,

I have been using clusterprofiler with good effect for a while now. Thanks.

I have used the similar commands for a set of genes in human and then a different set of genes in "ecsakai" for "Functional Profile of a gene sets"..
With "human" everything works fine.
When I run the command below, I get no output..

Command
ggo <- groupGO(gene=AllGenes_EntrezIds, organism="ecsakai", ont ="BP", level=4, readable = FALSE)

In analysis for "ecsakai", I have confirmed that the variable "AllGenes_EntrezIds" contains the EntrezIds in the required format (same format as in humans).

Output

head(summary(ggo))
ID Description Count GeneRatio geneID
GO:0000747 GO:0000747 conjugation with cellular fusion 0 0/5195
GO:0000909 GO:0000909 sporocarp development involved in sexual reproduction 0 0/5195
GO:0007276 GO:0007276 gamete generation 0 0/5195
GO:0007618 GO:0007618 mating 0 0/5195
GO:0009566 GO:0009566 fertilization 0 0/5195
GO:0034293 GO:0034293 sexual sporulation 0 0/5195

Appreciate your assistance,

Nandan

problem about geneList file

hi YGC,
I have read your clusterProfiler.R script.The geneList file that you have used in the script contains information as:

head(geneList)
4312 8318 10874 55143 55388 991
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857
And foldchange have been raised up latser in the script , I want to know how could I get this value of the gene ?

best
salviadr

GSEKEGG

Hi, Guangchuang,
Thanks for your help last time.
here is a new error when i use gseKEGG function.

kk2 <- gseKEGG(geneList = geneList,

  •            organism     = 'hsa',
    
  •            keyType = "kegg",
    
  •            nPerm        = 1000,
    
  •            minGSSize = 10, 
    
  •            maxGSSize = 500,
    
  •            pvalueCutoff = 1,
    
  •            verbose      = FALSE,
    
  •            seed=TRUE)
    
    Error in data.frame(ID = as.character(gs.name), Description = Description, :
    row names contain missing values

enrichKEGG and enrichGO works very well, actually.

Thanks

Chun

Input format for GSE analyses

Dear Guangchuang,

I carefully read the vignette coming with clusterProfiler, but didn't find how to format the geneList input object for gseKEGG and gseGO function.

Any help would be greatly appreciated.

Thanks a lot !

Pierre-François

No gene set have size > 10

Using bioc3.3

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.utf8        LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] KEGG.db_3.2.3         AnnotationDbi_1.34.3  IRanges_2.6.1        
[4] S4Vectors_0.10.1      Biobase_2.32.0        BiocGenerics_0.18.0  
[7] clusterProfiler_3.0.4 DOSE_2.10.3          

loaded via a namespace (and not attached):
 [1] graph_1.50.0       igraph_1.0.1       Rcpp_0.12.5        magrittr_1.5      
 [5] splines_3.3.1      munsell_0.4.3      lattice_0.20-33    xtable_1.8-2      
 [9] colorspace_1.2-6   stringr_1.0.0      plyr_1.8.4         tools_3.3.1       
[13] grid_3.3.1         gtable_0.2.0       DBI_0.4-1          matrixStats_0.50.2
[17] assertthat_0.1     tibble_1.0         GSEABase_1.34.0    GOSemSim_1.30.2   
[21] tidyr_0.5.1        DO.db_2.9          reshape2_1.4.1     ggplot2_2.1.0     
[25] qvalue_2.4.2       RSQLite_1.0.0      stringi_1.1.1      GO.db_3.3.0       
[29] scales_0.4.0       XML_3.98-1.4       SparseM_1.7        annotate_1.50.0   
[33] topGO_2.24.0 

The following example does not work


geneIds=structure(c(0.0173923238661593, 0.0143980919792023, 0.0143014753479478,
0.0141333294544166, 0.0139817936807474, 0.0138515590239863, 0.0133046300321645,
0.013199411621746, 0.0131538876919883, 0.0129413235474243, 0.0129404860673982,
0.0126872284406404, 0.0125505909232479, 0.0124183413868258, 0.0120056405688979,
0.0119079944195171, 0.0117673672237872, 0.0114767878521646, 0.0110154153502363,
0.010962467133225, 0.0107579352744376, 0.0106730532960365, 0.0106671690856796,
0.0106320434414624, 0.0105305751411446, 0.0103664110663275, 0.0103600890386993,
0.0102755438541349, 0.0102356297066653, 0.0101946788668722, 0.00997845041597925,
0.00995757065408425, 0.009662913411882, 0.00961775242988928,
0.00955211188998572, 0.00921268981722911, 0.00917063864909869,
0.00904387975341085, 0.00903311169583011, 0.00893000047504928,
0.00890876773255115, 0.00885629656304629, 0.00870443842831365,
0.00869724683575499, 0.00867406713420053, 0.00865943861254504,
0.00862115333556084, 0.00841744507887297, 0.00840157514129999,
0.00838607138704924, 0.00838360550611291, 0.00836231299226293,
0.00829316093534756, 0.00825250315108262, 0.00824987441230929,
0.00823524052926456, 0.00820494945234031, 0.00804862024838777,
0.00801582043022039, 0.00791148112257307, 0.00785578056999656,
0.0078409245867989, 0.00777636287369049, 0.00773808118660346,
0.00770599409084573, 0.0075633716141738, 0.00752725142306771,
0.00736226333900363, 0.00735543487767409, 0.00732717114623565,
0.00724876596636551, 0.00722753710320134, 0.00719976952480489,
0.00708564165580783, 0.0070767184609525, 0.00706854787093993,
0.00693769860320728, 0.00688172228037059, 0.00677958365255367,
0.00677424912936278, 0.0067155601071985, 0.00665302470158571,
0.00664183931679701, 0.00650743084578016, 0.00642268029159123,
0.00638120220089193, 0.0063164526515491, 0.00626146839403315,
0.00606126814393558, 0.00594652357272233, 0.00591188660798504,
0.00590814117167224, 0.00588818499767644, 0.00588647778094827,
0.00585113176628442, 0.00583771554041842, 0.00579307636797131,
0.0055846621881879, 0.00553417945827071, 0.00549470576027388,
0.0054855933700562, 0.00544129605356585, 0.00535724070475638,
0.00531105950580039, 0.0053110032724002, 0.0053026947161063,
0.00523673990846534, 0.00521010695178545, 0.00517599827176783,
0.00514853166546027, 0.00513998338193496, 0.00506601423822119,
0.00505210209404965, 0.00502583890189226, 0.00495145263074892,
0.00490516536276781, 0.00487841237664218, 0.00487109727880267,
0.00482576886953764, 0.00471517957704596, 0.00469713799942159,
0.00464654036037223, 0.00460305140692751, 0.00444341269119886,
0.00443874990992157, 0.00440667247341112, 0.00432692936466679,
0.00432535844099266, 0.00430757189049738, 0.00417257788164193,
0.00411592647077357, 0.0040890733410466, 0.00403905987973448,
0.00398916414013136, 0.00395994737341667, 0.00395225749924206,
0.00395225749924206, 0.00372376930387344, 0.00366394687094291,
0.00359647901575507, 0.00356988261557318, 0.00354751161961419,
0.00348484354476047, 0.00335716952982369, 0.0033380365509023,
0.0033084868013819, 0.0032402067289251, 0.00321296430232524,
0.00311513965658878, 0.00310434981778507, 0.00308181667938787,
0.00307839855368606, 0.00304257005915105, 0.00300501496935367,
0.00297621111942268, 0.00297522222648755, 0.00292585504128358,
0.00289996232145217, 0.0028941270634306, 0.00287497951798827,
0.00283664546302303, 0.00283101290801741, 0.00282614714359438,
0.00278623830448935, 0.00261925606218997, 0.00252391820547274,
0.00249341712237714, 0.00248739983886826, 0.00241311543528369,
0.00232065751947078, 0.00221746427817126, 0.00199948024352192,
0.00184083972232587, 0.00154025190112489, 0.00135464006211967,
0.000997402093429148, 0.000971760009484102, 0.000942799780648732
), .Names = c("15467", "213522", "110855", "13175", "15937",
"20446", "380916", "17754", "52670", "22439", "67266", "665095",
"71968", "242667", "11792", "234593", "105653", "74213", "232975",
"235180", "212514", "71492", "67544", "73296", "22153", "108699",
"625098", "18823", "80297", "245666", "104886", "67144", "105298",
"14681", "227632", "16512", "235379", "12972", "67473", "50876",
"15441", "12227", "19242", "14809", "22152", "12355", "19883",
"382245", "105844", "16502", "117167", "244198", "102580", "67792",
"78286", "20254", "68304", "80906", "241727", "18197", "27357",
"69923", "68203", "73710", "67937", "229615", "18074", "18803",
"110446", "26383", "56542", "72568", "239157", "20856", "381058",
"12933", "18604", "12307", "66235", "20394", "73827", "230822",
"19395", "246293", "50883", "235281", "268859", "381694", "15894",
"68718", "76071", "107831", "381112", "56075", "75646", "14623",
"18676", "14432", "76484", "16485", "57340", "271639", "19283",
"195727", "208111", "67855", "53378", "56471", "380614", "217143",
"215707", "208869", "69962", "94090", "69539", "17751", "269643",
"170790", "217201", "76161", "97112", "240067", "58222", "215090",
"242773", "76893", "94191", "18217", "245670", "69726", "210933",
"71591", "216558", "330361", "56744", "18729", "100041146", "72054",
"70426", "71691", "71990", "329207", "225583", "66353", "76376",
"212448", "72041", "320111", "218440", "13643", "333605", "67979",
"231086", "26382", "75610", "69823", "245532", "73708", "225655",
"19762", "22360", "14586", "16536", "67272", "24115", "70530",
"70362", "381628", "108073", "100043915", "386750", "114604",
"242570", "211612", "15416", "442829", "243866", "18429"))

require(clusterProfiler)
require(KEGG.db)

keggResults <- gseKEGG(gene = geneIds, organism = "mmu", use_internal_data=T)

It just reports that no gene set have size > 10 which is not true since most KEGG pathways include more that 10 genes and the provided gene list also contains much more than 10 genes.

Also the latest development version from github "clusterProfiler_3.1.3 + DOSE_2.11.5" fails to process the example but with a different error that is: Error in max(pathwaysSizes) : invalid 'type' (list) of argument

TAIR ID not working for Arabidopsis

I tried TAIR ID for enrichGO and got the following message.

Error in .testForValidKeys(x, keys, keytype) :
None of the keys entered are valid keys for 'ENTREZID'. Please use the keys method to see a listing of valid arguments.

I also tried the code you posted.

c= head(keys(org.At.tair.db))
[1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060"
and the result is the same.
c
[1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050"
[6] "AT1G01060"
ego <- enrichGO(gene = c,
organism = "arabidopsis",
ont = "MF",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)

Error in .testForValidKeys(x, keys, keytype) :
None of the keys entered are valid keys for 'ENTREZID'. Please use the keys method to see a listing of valid arguments.

That's why I tried to use ENTREZ ID and got a N/A.
Is there anything wrong with arabidopsis as a parameter?

too many GO enrichment result compared with amigo2

Hi, Dr Yu:

thanks a lot for you very useful software clusterPorfiler.

I compare the GO enrichment result from clusterProfiler and amigo2, and I find total result occured(221) in clusterprofiler outcome,and more results occured(522) in clusterprofiler’s。
why clusterProfiler obtained much more results of GO enrichmeng。

Can't install latest version of clusterProfiler from github

Thanks for your quick response to my feature request for KEGG module enrichment! I'm eager to try it out, but I keep getting a weird error when trying to install from github, and the install fails. Any ideas?
Error : object ‘aes_’ is not exported by 'namespace:ggplot2'
ERROR: lazy loading failed for package ‘clusterProfiler’

  • removing ‘/Library/Frameworks/R.framework/Versions/3.2/Resources/library/clusterProfiler’
  • restoring previous ‘/Library/Frameworks/R.framework/Versions/3.2/Resources/library/clusterProfiler’
    Error: Command failed (1)

full output:

install_github(c("GuangchuangYu/DOSE", "GuangchuangYu/clusterProfiler"))
Downloading GitHub repo GuangchuangYu/DOSE@master
Installing DOSE
Skipping 2 packages not available: DO.db, GO.db
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save
--no-restore CMD INSTALL
'/private/var/folders/g4/nw8h4p2556j6cxd22t9sb4b40000gn/T/RtmpqefDxc/devtools103906771e42f/GuangchuangYu-DOSE-9214bcc'
--library='/Library/Frameworks/R.framework/Versions/3.2/Resources/library'
--install-tests

  • installing source package ‘DOSE’ ...
    ** R
    ** data
    ** inst
    ** preparing package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    [1]
    [2]
    [3]
    [4]
    [5]
    [6]
    [7]
    [8]
    [9]
    [10]
    [11]
    [12]
    [13]
    [14]
    [15]
    [16]
    [17]
    [18]
    [19]
    [20]
    [21]
    [22]
    [23]
    [24]
    [25]
    [26]
    [27]
    [28]
    [29]
    Warning: replacing previous import by ‘graph::.__C__dist’ when loading ‘topGO’
    ** installing vignettes
    ** testing if installed package can be loaded
  • DONE (DOSE)
    Downloading GitHub repo GuangchuangYu/clusterProfiler@master
    Installing clusterProfiler
    Skipping 2 packages not available: DO.db, GO.db
    Skipping 1 packages ahead of CRAN: DOSE
    '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save
    --no-restore CMD INSTALL
    '/private/var/folders/g4/nw8h4p2556j6cxd22t9sb4b40000gn/T/RtmpqefDxc/devtools103906ccfd4d0/GuangchuangYu-clusterProfiler-b5abcdc'
    --library='/Library/Frameworks/R.framework/Versions/3.2/Resources/library'
    --install-tests
  • installing source package ‘clusterProfiler’ ...
    ** R
    ** data
    ** inst
    ** preparing package for lazy loading
    Error : object ‘aes_’ is not exported by 'namespace:ggplot2'
    ERROR: lazy loading failed for package ‘clusterProfiler’
  • removing ‘/Library/Frameworks/R.framework/Versions/3.2/Resources/library/clusterProfiler’
  • restoring previous ‘/Library/Frameworks/R.framework/Versions/3.2/Resources/library/clusterProfiler’
    Error: Command failed (1)

Session Info:

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] BiocInstaller_1.20.1 ggplot2_1.0.1 devtools_1.9.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.2 digest_0.6.8 MASS_7.3-45 grid_3.2.2 R6_2.1.1
[6] plyr_1.8.3 gtable_0.1.2 magrittr_1.5 scales_0.3.0 httr_1.0.0
[11] stringi_1.0-1 reshape2_1.4.1 curl_0.9.4 rstudioapi_0.3.1 proto_0.3-10
[16] tools_3.2.2 stringr_1.0.0 munsell_0.4.2 colorspace_1.2-6 memoise_0.2.1
[21] knitr_1.11

The total of genes in GeneRatio can be higher than the initial number of genes

More than an issue, this is just a question.

Let's calculate the enrichment of three genes: 1, 100 and 1000

> summary(groupGO(as.character(c(1,100,1000))))
                  ID                               Description Count GeneRatio     geneID
GO:0016020 GO:0016020                                  membrane     2       2/5   100/1000
GO:0005576 GO:0005576                      extracellular region     3       3/5 1/100/1000
GO:0005581 GO:0005581                           collagen trimer     0       0/5           
GO:0005623 GO:0005623                                      cell     2       2/5   100/1000
GO:0009295 GO:0009295                                  nucleoid     0       0/5           
GO:0019012 GO:0019012                                    virion     0       0/5           
GO:0030054 GO:0030054                             cell junction     2       2/5   100/1000
GO:0031012 GO:0031012                      extracellular matrix     0       0/5           

In the GeneRatio column, the output shows that there are 2 out of 5 genes matching the term "membrane". However, the input vector contained only three Entrez Ids (1, 100 and 1000). Shouldn't the total be 3 genes instead of 5?

I imagine that this can be due to one-to-many relationships between Entrez Ids and GeneOntology. For example, Entrez 1 may correspond to more than one gene in GO. However, how can I check this?

Matching EntrezID with go term

Hi Guangchuang,

not really an issue, but more of a question (quite possibly related to DOSE). I am interested in creating a data.frame with my original data, which included entrezIDs, and add associated GO terms. Is there an easy/simple way of matching entrezIDs with the GO terms returned by clusterProfiler?

For instance, when using groupGO the result (summary(head(ggo))) is a table with the GO_ID, description, (..), and geneID. However, as far as can see the geneID is the gene symbol, and not the input for groupGO (entrezID). The information currently returned by groupGO is very informative, but This makes it somewhat difficult to merge together with the original input.

The information seems to be in ggo@geneInCategory, and I am just wondering if there is a way of getting in a nice format. For instance (made up IDs):

> Entrez   GO
> 0065009 GO:0044085 regulation of transcription
> 0012890 GO:0019953 sexual reproduction
> (..)

This would be really helpful.

Cheers.

enrichKEGG

Hi,

I'd like to use enrichKEGG to look at enrichment of KEGG pathways in Staphylococcus aureus which is a species that is not previously supported per your example.

when I tried in Rstudio by modifying the example (shown here):

gene = HG003_DE_genes
head(gene)
Locus
1 SAOUHSC_00027
2 SAOUHSC_00045
3 SAOUHSC_00046
4 SAOUHSC_00047
5 SAOUHSC_00071
6 SAOUHSC_00087

hg003KEGG = enrichKEGG(gene, organism="sao")

I got the following error:
Error: could not find function "enrichKEGG"

I then tried to install enrichKEGG and got the following error:

install.packages("enrichKEGG")
Warning in install.packages :
package ‘enrichKEGG’ is not available (for R version 3.2.2)

Do you have advice to help resolve these issues?

Thank you!

Carolyn Ibberson

Comparison clusterProfiler::gseGO with GSEA-P

Dear Guangchuang.

I've been running some side-by-side comparisons of GSEA-P and gseGO from clusterProfiler, using the same ranked-by-logFC list of genes.

I am linking to two corresponding analysis files for the ontology "cellular component".

Two things struck me, maybe you can provide some insight.

  1. If I compare terms (positive phenotype correlation) that are enriched in both methods, overlap is very poor. While there are some overlapping terms, their enrichment scores/p-values are very different.
  2. P-values/adjusted p-values/q-values from gseGO are generally smaller and don't show a lot of variation. Corresponding values from GSEA-P usually show a larger dynamic range (from the threshold all the way to 1.0).

GSEA analysis using clusterProfiler::gseGO
GSEA analysis using GSEA-P (positive phenotype correlation only)

Any help in clarifying this situation would be greatly appreciated.

Best,
Maurits

PS.
Parameter sets for both methods should be consistent, see below.

I used gseGO with the following parameters:

res.GSEA.GO<-gseGO(geneList = geneList,
                   organism = "human",
                   exponent = 1,
                   ont = ont[i],
                   nPerm = 1000,
                   minGSSize = 15,
                   pvalueCutoff = 0.25,
                   verbose = TRUE);
 t<-summary(res.GSEA.GO);
# Exclude terms that are larger than 500
# (default in GSEA-P)
 t<-t[which(t$setSize<500),];

GSEA-P was run with the following parameters:

producer_class  xtools.gsea.GseaPreranked
producer_timestamp  1445920145899
param   collapse    true
param   plot_top_x  20
param   rnk /home/maurits/Desktop/clusterProfiler/list.rnk
param   norm    meandiv
param   scoring_scheme  weighted
param   make_sets   true
param   mode    Max_probe
param   gmx gseaftp.broadinstitute.org://pub/gsea/gene_sets/c5.cc.v5.0.symbols.gmt
param   gui false
param   chip    gseaftp.broadinstitute.org://pub/gsea/annotations/GENE_SYMBOL.chip
param   rpt_label   my_analysis
param   help    false
param   out /home/maurits/gsea_home/output/oct27
param   include_only_symbols    true
param   set_min 15
param   nperm   1000
param   rnd_seed    timestamp
param   zip_report  false
param   set_max 500

extend the formula interface to gene set enrichment analysis

Hi Guangchuang,
I was wondering if the formula interface of compareCluster could be modified to include also gene set enrichment analysis like gseGO.

If I understand correctly, compareCluster can be used to apply gseGO and others to list of lists of genes:

> mylist = list('a'=geneList[0:200], 'b'=geneList[200:400])
> compareCluster(mylist, fun='gseGO')

I wonder how this could be extended to the formula interface. The major difficulty is to define the column containing expression values.

> mydf = data.frame('entrez'=names(geneList)[0:400], 'expression'=geneList[0:400], group=c(rep('a', 200), rep('b', 200)))

option 1: using a new parameter
> compareCluster(data=mydf, entrez~group, scores=expression, fun='gseGO')

option 2: including it in the formula
> compareCluster(data=mydf, expression | entrez~group, scores=expression, fun='gseGO')
Which option is nicer?

compareCluster (GroupsID, fun= enrichPathway)

Hi,

genesGroups <- list(
brca_tcga=c("CARTPT",  "ATP1A4" , "IL22RA2", "CXCR2"  , "PARN")
,gbm_tcga =c("ATP1A2",  "SLC4A4" , "ATP1A3" , "NQO1"  ,  "ATP8B1" , "SLC4A3" , "IL2RB"  , "NPPA"  ,  "ATP2B3",  "SLC15A1", "SLC5A1" , "PNPLA8"  ,"CCKBR"  , "C3AR1" ,  "PLCD1"  , "STAB1" ,  "SLC12A7", "PLCB2" ,  "SRD5A1" , "SLC22A4" ,"IL3RA" ,  "BDKRB2" , "ATP1A1" , "ASPH")
,lihc_tcga =c("AADAC"  ,  "AIFM1"  ,  "SARDH"  ,  "NQO2"  ,   "CYP1A2" ,  "IL6R"   ,  "ABCB11" ,  "LOXL2" ,   "SLC22A11" ,"SLC6A2" ,  "PNPLA3" ,  "EDNRA"  ,  "SLC15A2" , "SLC9A3" ,  "REN"   ,   "SLC9A5"  , "CHRM1"   , "IL12A" ,   "GNB3"   ,  "NCF1"   ,  "ACE"   ,   "NDUFS2" , "PLCE1" ,   "TIA1"   ,  "AKR7A3" ,  "ATP2B4")
,lusc_tcga= c("ATP11B", "CXCL1"  , "QDPR"  ,  "AGT"   ,  "MAPT" ,   "AGTR2" ,  "CASR" ,   "IL5RA"  , "ATP4A"  , "CHRM3"  , "IL2RA" ,  "ABCD3" ,  "VLDLR"  , "CDH13" ,  "NCF2"  ,  "SLC20A2")
)

GroupsID <- lapply(genesGroups,function(x) unname(unlist(translate(x, org.Hs.egSYMBOL2EG))))

dp <- compareCluster(GroupsID, fun="enrichPathway")

Error in eval(expr, envir, enclos) : object 'enrichPathway' not found

Thanks

loading data

HI Guangchuang,
First, thanks for the great tools. I am determined to use them correctly and frequently!
I'm learning R and I am currently having trouble inputing my own data into clusterProfiler. I can get all functions to work beautifully using geneList example data, however I cannot manage to get my data correctly formatted for this analysis. I have a list of gene symbols and their decreasing fold change expression values, I convert these to entrezid using bitr:
all_gene = scan("all_gene.txt", what="", sep="\n")
all_gene_tx = bitr(all_gene, fromType="SYMBOL", toType="ENTREZID", annoDb="org.Mm.eg.db")
write.csv(all_gene_tx, file="all_gene_tx.csv")

use vlookup in excel to match up all remapped entrezids to genesymbols and make translated.txt file col1=entrezid, col2=log2 fold-change

tester = read.table("translated.txt", sep="\t", header=T, row.names=1)
head(tester)
x
107574 -7.689993
26380 -6.752766
107575 -6.186174
100126433 -5.955670
12323 -5.888340
386750 -5.771180

This is different than the geneList:
head(geneList)
4312 8318 10874 55143 55388 991
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857

so, I try to transpose dataset in excel (renamed to DEG.txt), but the columns get an X added to the names:
dF4_gene <- as.matrix(read.table("DEG.txt", header=TRUE, sep = "\t",
row.names = 1,
as.is=TRUE))
head (dF4_gene)
X26380 X107575 X100126433 X12323 X386750 X13390
-7.689993246 -6.752766 -6.186174 -5.95567 -5.88834 -5.77118 -5.749012
X68070 X67937 X677296 X268663 X12931 X212124
-7.689993246 -5.576549 -5.531749 -5.34585 -5.340453 -5.301863 -5.232462
X224833 X18585 X18667 X104174 X328795 X12145 X19713
-7.689993246 -5.13658 -5.064886 -5.010068 -4.97236 -4.63558 -4.618544 -4.603771

Basically, how can I go from a 2 column text file with entrezid and expression_change to run enrichment tests, etc in clusterprofiler?

Thanks,
Seth

GSEA function (clusterProfiler R package)

Hi Guangchuang Yu,

I would like to perform enrichment analysis on single cell data using GSEA function with 4 classes.
gsea.out = GSEA(geneList = data[,1],TERM2GENE=term2gene,pvalueCutoff =1,minGSSize=5,exponent= 1)

where:
data[,1], is ranked according to the expression values
CD3D CD3G CD27 SIT1 CEP68 CD27-AS1
34.69430 32.17648 31.58424 26.38844 17.55915 17.53385

and term2gene, look as follow
term name
1 ILC1 CCL5
2 ILC1 CD28
3 ILC1 CD3E
4 ILC1 GRAMD3
5 ILC1 CD7
6 ILC1 SIDT1
...

The problem is that I have that error and I don't understand why. I also tried to find the whole code for GSEA to get more information about that error but I can not find that, could you help me?

preparing geneSet collections...
[1] "calculating observed enrichment scores..."
[1] "calculating permutation scores..."
| | 0%
[1] "calculating p values..."
Error in pi0est(p, ...) :
ERROR: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda.

Thank you very much in advance
Madeleine

Error in compareCluster - enrichDAVID

When using enrichDAVID in compareCluster, if none of the gene sets retrieves signicant enrichment from the tested category, there is a fairly obsocure error:

Error in if (pi0 <= 0) { : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In min(p) : no non-missing arguments to min; returning Inf
2: In max(p) : no non-missing arguments to max; returning -Inf

I ran a few tests until figuring that is the problem. The ouptut of x <- getFunctionalAnnotationChart(david, threshold=1, count=minGSSize) will be:

DAVID Result object
Result type: FunctionalAnnotationChart
data frame with 0 columns and 0 rows
For all elements of gene. I am posting it here so that other users will know what the issue is.

missing functions

Hi Guangchuang,

This may be something simple I'm not getting, but I loaded clusterProfiler from bioconductor and it seems to be missing functions included in the manual and vignette. For instance, dropGO(), dotplot(), and plotGOgraph() aren't recognized. Other functions, like enrichGO() or compareCluster(), work fine. Is there something I'm missing?

Thanks,

Amanda

change the text orientation along the x-axis

Hi,
I want to plot multiple clusters in the same enrichment map.
I wonder if it's possible to change the orientation of the cluster labels which are drawn along the x-axis.
Thank you very much.
Kindly regards,
Erik

setting GO level in function compareClusters

Hi Guangchuang Yu,

It would be useful to set the level of the GO hierarchy in function compareClusters like in David GO analyser tool ("BP4" for example).

Because, sometimes the less specific gene sets involve too general terms.

Best Regards,
Attila Horvath

error in gse function

When i run gse functions, such as ,
gsekegg <- gseKEGG(geneList = gse_mb_5,
organism = "human",
nPerm = 100,
minGSSize = 120,
pvalueCutoff = 0.01,
verbose = TRUE)
but always has errors,
Error in if (abs(max.ES) > abs(min.ES)) { :
missing value where TRUE/FALSE needed

    Any suggestions?

Problems with many version 2.2.5

Hi,
I am having a number of issues with some functions of the version 2.2.5. For example the only the formula interface of compareCluster works; the barplot, plot, enrichMap of the results of compareCluster through errors, etc.
How do I solve this issue? Is there a big change between this development version and the version 2.2.5?
Do I have to install development version for R/Bioconductor and then clusterProfiler?

I'd be glad for timely reply.

Gabriel Teku
Lund University

Compilation problem

Hi,
I am trying to compile clusterProfiler on a linux server (ubuntu) but it crashes and I am getting the following errors. Please could you tell me what the problem coud be! Thank you very much.
Ferdinand

{
    .Deprecated("tidy_source", package = "formatR")
    tidy_source(...)
}' is deprecated.
Use 'tidy_source' instead.
See help("Deprecated") and help("formatR-deprecated").
Warning in tidy_source(...) :
  The argument 'keep.blank.line' is deprecated; please use 'blank'
Quitting from lines 309-310 (clusterProfiler.Rnw) 
Error: processing vignette 'clusterProfiler.Rnw' failed with diagnostics:
argument "geneSetID" is missing, with no default
Execution halted
Error: Command failed (1)

enrichKEGG, readable=TRUE is not working?

Hi, Guangchunag, I am using the your clusterprofiler right now. IT is great.
Here is a error i can not solve:

kk<- enrichKEGG(gene=gene,

  •             organism = "hsa",
    
  •             keyType = "kegg",
    
  •             pvalueCutoff = 1,
    
  •             pAdjustMethod = "BH",
    
  •             qvalueCutoff= 1,
    
  •             #universe=res.genelist, 
    
  •             #use_internal_data = T,
    
  •             readable=TRUE)
    
    Error in enrichKEGG(gene = gene, organism = "hsa", keyType = "kegg", pvalueCutoff = 1, :
    unused argument (readable = TRUE)

anysuggestions ?

gseGO: p-values and gseaplot

Dear Guangchuang.

I have come across two issues, maybe you can clarify.

I perform a GSEA analysis within clusterProfiler using

res.GSEA.GO<-gseGO(geneList = geneList, organism = "human", exponent = 1, ont = "BP", nPerm = 1000, minGSSize = 15, pvalueCutoff = 0.01, verbose = TRUE);

1.) The resulting table of GO terms all seem to have the same p-values, adjusted p-values, and q-value. For example notice the entries in the last column (qvalues = 0.00470813780684201) :

ID Description setSize enrichmentScore pvalue p.adjust qvalues
GO:0000070 GO:0000070 mitotic sister chromatid segregation 104 0.183705167498055 0.000999000999000999 0.00691405858579111 0.00470813780684201
GO:0000075 GO:0000075 cell cycle checkpoint 244 0.1685954733772 0.000999000999000999 0.00691405858579111 0.00470813780684201
GO:0000077 GO:0000077 DNA damage checkpoint 152 0.189584910513727 0.000999000999000999 0.00691405858579111 0.00470813780684201
GO:0000082 GO:0000082 G1/S transition of mitotic cell cycle 242 0.148857665664357 0.000999000999000999 0.00691405858579111 0.00470813780684201

The results are similar for other ontologies.

2.) All GSEA plots seem to have a discontinuity in the the "phenotype" curves. See e.g. here http://imgur.com/AcH49Bh .

Any help in resolving these issues would be greatly appreciated.

Best,
Maurits

GO enrichment at specific level

I don't think this is useful. I recommend using simplify instead of testing at specific level.

But as several users ask for that, I will introduce a gofilter function to screen the output from enrichGO and compareCluster by user specific GO level.

simplify function

Hello,

when using the simplify function I get the following error message:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘simplify’ for signature ‘"enrichResult"’

What am I missing?

Thanks

Oliver

GO term not found error, Yeast data, gseGO test

Hi,

I am trying to use gseGO enrichment test on my data, which contains yeast ORF identifiers (Table$IDs). If I try (for example):
ego2 <- gseGO(geneList = Table$IDs, organism = "yeast", ont = "MF",
nPerm = 100, minGSSize = 120, pvalueCutoff = 0.01, verbose = FALSE)

I get the following error:

Error in .checkKeys(value, Lkeys(x), x@ifnotfound) :
value for "GO:0010834" not found

This is also true if I test for GO CC or GO BP too. This GO term seems to be deprecated now. How I can I get around this error?

Best Regards,

Dean

question about the "%<>%"?

Hello Guangchuang,

I found the character "%<>%" in the groupGO function, could you explain something about it? How could I use it by myself?

Thank you very much.

Best Regards,
Shisheng

Error in TERM2NAME.Reactome(qTermID, organism, ...) : function "mapIds" not found

Hi, after updating the package now I am not able to use the enrichPathway function.

cr_r=compareCluster(geneCluster=l, fun="enrichPathway", organism="mouse", universe=entrez, pvalueCutoff=.2, minGSSize = 5)
Error in TERM2NAME.Reactome(qTermID, organism, ...) :
function "mapIds" not found

str(entrez)
chr [1:14121] "66836" "381629" "70316" "71667" "66835" "76251" "58520" "68327" "545192" ...

str(l)
List of 4
$ anti210.healthy-scr.healthy : num [1:10] 11764 14264 11767 24117 12643 ...
$ anti210.diabetic-scr.diabetic : num [1:34] 11764 231887 11767 17882 16978 ...
$ scr.diabetic-scr.healthy : chr [1:1423] "67041" "20378" "12667" "56047" ...
$ anti210.diabetic-anti210.healthy: num [1:97] 16545 20378 22092 12643 106763 ...

Any solution?

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=it_IT.UTF-8 LC_NUMERIC=C LC_TIME=it_IT.UTF-8 LC_COLLATE=it_IT.UTF-8
[5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=it_IT.UTF-8 LC_PAPER=it_IT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] DOSE_2.6.4 GSEABase_1.24.0 graph_1.40.1 annotate_1.40.1 org.Mm.eg.db_2.10.1
[6] beadarray_2.12.0 ggplot2_1.0.1 clusterProfiler_2.0.1 ReactomePA_1.12.2 RSQLite_1.0.0
[11] DBI_0.3.1 AnnotationDbi_1.28.2 GenomeInfoDb_1.4.1 Biobase_2.22.0 XVector_0.8.0
[16] IRanges_2.2.5 S4Vectors_0.6.1 BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] BeadDataPackR_1.14.0 colorspace_1.2-6 digest_0.6.8 DO.db_2.7 GO.db_2.10.1
[6] GOSemSim_1.20.3 graphite_1.8.1 grid_3.1.2 gtable_0.1.2 htmltools_0.2.6
[11] igraph_1.0.1 KEGG.db_2.10.1 limma_3.18.13 magrittr_1.5 MASS_7.3-42
[16] munsell_0.4.2 org.Hs.eg.db_2.10.1 plyr_1.8.3 proto_0.3-10 qvalue_1.36.0
[21] Rcpp_0.11.6 reactome.db_1.46.1 reshape2_1.4.1 rmarkdown_0.7 scales_0.2.5
[26] stringi_0.5-5 stringr_1.0.0 tcltk_3.1.2 tools_3.1.2 XML_3.98-1.3
[31] xtable_1.7-4 yaml_2.1.13

Error in is(OrgDb, "character")

Hi Dr. Yu,
Please find a reproduce example.

genesGroups <- list(brca_tcga= "FANCF", gbm_tcga="MLH1", lihc_tcga=c("BRCA2","MSH2","ATR"), lusc_tcga="CHEK2")
GroupsID <- lapply(genesGroups,function(x) unname(unlist(translate(x, org.Hs.egSYMBOL2EG))))
cgo <- compareCluster(GroupsID, fun="enrichGO")

Error in is(OrgDb, "character") : 
argument "OrgDb" is missing, with no default
 cgo <-compareCluster(GroupsID, fun="enrichKEGG")

Error in (function (cl, name, valueClass)  : 
 ‘ontology’ is not a slot in class “NULL”

Before That worked with the same example. Any suggestion?
Thanks,
Karim

enrichGO not working for arabidopsis

ego <- enrichGO(gene = gene, organism = "arabidopsis", ont = "MF", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05, readable = TRUE)

ego
[1] NA

head(gene)
[1] "820275" "828060" "827487" "819651" "838761" "837379"

No matter how I tried I always got a NA. I tried human genes from your sample data and it works fine.
Is there any issues with arabidopsis as a parameter?

enirchKEGG in Arabidopsis - problem

Hi!

I like your tool - I've been using it earlier for human genes analysis without any trobules. Now I have some data from Arabidopsis studies which I need to analyse urgently and it happend not to be an easy task.. I have some problems using the enrichKEGG() function and I get the error again and again. So from the begining..

I upload libraries..

library("clusterProfiler")
library("biomaRt")
library("org.At.tair.db")
library("DOSE")

I prepare test and universe sets form TAIR locus ID so they are in a form of vector of charactes - please see below:

targets_id
[1] "AT1G02860" "AT1G02860" "AT1G44900" "AT2G03820" "AT2G16500" "AT2G45160" "AT3G60630" "AT4G00150" "AT5G14550" "AT5G23480"

background_id
[1] "AT1G02860" "AT1G02860" "AT1G06180" "AT1G06180" "AT1G06580" "AT1G06580" "AT1G10120" "AT1G11810" "AT1G11990" "AT1G12290"
[11] "AT1G12820" "AT1G15780" "AT1G16320" "AT1G17590" "AT1G17590" "AT1G17590" "AT1G17590" "AT1G17590" "AT1G17590" "AT1G17590"
[21] "AT1G17590" "AT1G17590" "AT1G17590" "AT1G17590" "AT1G17590" "AT1G20710" "AT1G21160" "AT1G22000" "AT1G26890" "AT1G27340"
[31] "AT1G27360" "AT1G27360" "AT1G27360" "AT1G27360" "AT1G27360" "AT1G27360" "AT1G27360" "AT1G27370" "AT1G27370" "AT1G27370"
[41] "AT1G27370" "AT1G27370" "AT1G27370" "AT1G27370" "AT1G30210" "AT1G30330" "AT1G30490" "AT1G30490" "AT1G31280" "AT1G32340"
[51] "AT1G38790" "AT1G44900" "AT1G49910" "AT1G50820" "AT1G50930" "AT1G51480" "AT1G51760" "AT1G52050" "AT1G52070" "AT1G52120"
[61] "AT1G52130" "AT1G52150" "AT1G52770" "AT1G53160" "AT1G53160" "AT1G53160" "AT1G53160" "AT1G53160" "AT1G53160" "AT1G53230"
[71] "AT1G53700" "AT1G54160" "AT1G54160" "AT1G54160" "AT1G54710" "AT1G56010" "AT1G57570" "AT1G57570" "AT1G60095" "AT1G60110"
[81] "AT1G60130" "AT1G61230" "AT1G62260" "AT1G62590" "AT1G62590" "AT1G62670" "AT1G62720" "AT1G62910" "AT1G62930" "AT1G63080"
[91] "AT1G63130" "AT1G63150" "AT1G63230" "AT1G63240" "AT1G63330" "AT1G63630" "AT1G64580" "AT1G64583" "AT1G66230" "AT1G66230"
[101] "AT1G66690" "AT1G66700" "AT1G66720" "AT1G66880" "AT1G67230" "AT1G69170" "AT1G69170" "AT1G69170" "AT1G69170" "AT1G69170"
[111] "AT1G69170" "AT1G69170" "AT1G69770" "AT1G70700" "AT1G70700" "AT1G71400" "AT1G71910" "AT1G72830" "AT1G72830" "AT1G72830"
[121] "AT1G73440" "AT1G77850" "AT1G78500" "AT1G79950" "AT1G79990" "AT1G80060" "AT1G80740" "AT2G02850" "AT2G03820" "AT2G06095"
[131] "AT2G13900" "AT2G16500" "AT2G21800" "AT2G21840" "AT2G22810" "AT2G22840" "AT2G23550" "AT2G23550" "AT2G23560" "AT2G23560"
[141] "AT2G23570" "AT2G23570" "AT2G23580" "AT2G23580" "AT2G23580" "AT2G23580" "AT2G23610" "AT2G23610" "AT2G25430" "AT2G28350"
[151] "AT2G28550" "AT2G28550" "AT2G28780" "AT2G28780" "AT2G29130" "AT2G29130" "AT2G29730" "AT2G31070" "AT2G31070" "AT2G33350"
[161] "AT2G33350" "AT2G33770" "AT2G33770" "AT2G33770" "AT2G33770" "AT2G33810" "AT2G33810" "AT2G34585" "AT2G34710" "AT2G34710"
[171] "AT2G36026" "AT2G36400" "AT2G38810" "AT2G41050" "AT2G42200" "AT2G42200" "AT2G42200" "AT2G42200" "AT2G42200" "AT2G42200"
[181] "AT2G42200" "AT2G42430" "AT2G44190" "AT2G45160" "AT2G45160" "AT2G45160" "AT2G47020" "AT2G47020" "AT2G47020" "AT2G47020"
[191] "AT2G47460" "AT3G03580" "AT3G03580" "AT3G04820" "AT3G05690" "AT3G08500" "AT3G09220" "AT3G10800" "AT3G12977" "AT3G13830"

And then I would like to perform an KEGG enrichment analysis

kk <- enrichKEGG(gene = targets_id, universe = background_id, organism = "arabidopsis", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable = TRUE, use_internal_data = FALSE)
Error in data.frame(numWdrawn = k - 1, numW = M, numB = N - M, numDrawn = n) :
arguments imply differing number of rows: 1, 0

When I remove universe part, so compare my genes to "entire" universe I get

kk <- enrichKEGG(gene = targets_id, organism = "arabidopsis", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable = TRUE, use_internal_data = FALSE)
'select()' returned many:many mapping between keys and columns

I know there are some duplicated IDs in my gene set, so I decide to check he uniqe set by..

kk <- enrichKEGG(gene = unique(targets_id), organism = "arabidopsis", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable = TRUE, use_internal_data = FALSE)
'select()' returned 1:many mapping between keys and columns

I have now idea what I'm doing wrong or what I must aditionally do to make it work. I will be very grateful for help - this very important and urgant for me.

Thak you in advance !!
Anna

reduce redundancy of enriched GO terms

To simplify the enriched result, we can use slim version of GO and use enricher() function to analyze.

Another strategy is to use GOSemSim to calculate similarity of GO terms and remove those highly similar terms by keeping one representative term.

The criteria of selecting representative term can be:

  • most informative term (need pre-calculated IC data, only available for those internally supported organisms in GOSemSim; can be extended to un-supported organisms).
  • most significant term (as in REVIGO)

I prefer using the second criteria for it's more intuitive and more easy to implement for those not internally supported by GOSemSim.

I propose to define a function to simplify the output from enrichGO by removing redundant GO terms.

simplify <- function(enrichResult, cutoff=0.7, by="p.adjust", select_fun=min) {
     ## GO terms that have semantic similarity higher than `cutoff` are treated as redundant terms
     ## select one representative term by applying `select_fun` to feature specifying by `by`.
     ## user can defined their own `select_fun` function.

     ## return an updated `enrichResult` object.
}

Any comment/suggestion is welcome.

Reference:

maxGSSize in enrichGO?

Dear Guangchuang.

I've got another question/request, this time concerning the return output of enrichGO. Is there a simple way to filter entries based on an upper (max) bound on the number of genes associated with a particular term? Something like an upper bound equivalent to minGSSize?

The reason being that I always end up with e.g. a (trivial) 100% overlap with genes from term "biological process" if I run a BP GO enrichment analysis. I can filter out these events if I do a summary(...), but it would be nice to do this directly using the enrichResult return object, so I can use the filtered results for visualisation.

Thanks,
Maurits

Using lists vs. vectors for enrichKEGG()

Hello,

I've found that enrichKEGG() seems to work with a list of Entrez gene IDs but not a vector of Entrez gene IDs, contrary to the documentation given with the package on Bioconductor right now:

Here is an R script that reproduces this behavior:

common_gene_ids <- readRDS('sample_genes.rds')

library(clusterProfiler)
library(org.Hs.eg.db)

print(common_gene_ids[1:10])
# [1] "DDX11L1"      "WASH7P"       "OR4F5"        "LOC729737"    "LOC100133331" "LOC100288069" "LINC00115"    "LOC643837"    "FAM41C"      
# [10] "LOC100130417"

#########################################
# Creating a mapping to Entrez gene IDs #
#########################################

# Convert the object to a list
common_to_entrez <- as.list(org.Hs.egALIAS2EG)
# Remove pathway identifiers that do not map to any entrez gene id
common_to_entrez <- common_to_entrez[!is.na(common_to_entrez)]

#################################
# Filtering of common -> Entrez #
#################################

entrez_ids <- common_to_entrez[names(common_to_entrez) %in% common_gene_ids]

class(entrez_ids)
# [1] "list"

#####################################
# Running the enrichKEGG() function #
#####################################

# This gives a result, but for a type of "list"
kegg_results <- enrichKEGG(entrez_ids, pvalueCutoff=0.05)

# On the other hand, if I try to apply this on a vector of "character..."
vector_entrez_ids <- unlist(entrez_ids)
kegg_results_vector <- enrichKEGG(vector_entrez_ids, pvalueCutoff=0.05)
# This gives me an error.

And here is the related RDS file, holding gene IDs:
http://s000.tinyupload.com/index.php?file_id=50396891962239121264

Please excuse me if I've made a mistake.

Arabidopsis gene mapping problem.

Dear Yu!

Hello!

I try to analyze arabidopsis gene ontology but there's some problem in mapping.

Here is what I did.

source("http://bioconductor.org/biocLite.R")
biocLite("AnnotationHub")

biocLite("org.At.tair.db")


install.packages("devtools")
devtools::install_github(c("GuangchuangYu/DOSE", "GuangchuangYu/clusterProfiler"))


library(AnnotationHub)
library(clusterProfiler)
library("org.At.tair.db")

sample_gene <- read.table("~/Downloads/ATH_test", header=F)


 head(sample_gene)
         V1
1 AT4G19500
2 AT3G14210
3 AT2G32160
4 AT4G19530
5 AT1G24880
6 AT3G22231

 de <- enrichGO(gene = sample_gene, OrgDb="org.At.tair.db", keytype="TAIR", universe = sample_gene, pAdjustMethod = "BH", pvalueCutoff=0.05, qvalueCutoff=0.05)
No gene can be mapped....

--> return NULL...

Can you please give me a comment for this?

Thank you.

Won

ERROR: The estimated pi0 <= 0...

Hi,

I started playing with your clusterProfiler (btw, very nice idea for visualization) but for some reason, as I've seen after some googling, related to q-value, I'm getting this error for my clusters:

ERROR: The estimated pi0 <= 0. Check that you have valid p-values or use another lambda method.

It happens when I run fun="enrichPathway" or "enrichKEGG" but not for "enrichGO". I guess I can't change anything (do I?) in the settings of these two functions so I was trying to change the number of genes per cluster (>5, >15..). Still the same problem. Here how the data looks like (for genes # > 5):

$V1
[1] "6531" "9463" "6804" "1759" "9369" "54413" "5621" "348" "1804" "6616" "9378" "127833" "30010" "6857" "9751" "1270" "6529" "6844" "6855" "10814" "10815"
[22] "6532" "4340" "5071" "2899" "2897" "3211" "4211" "2917" "6530" "1271" "3098" "492" "2103"

$V2
[1] "1813" "1815" "3760" "3062" "3060" "135" "1268" "50632" "1812" "93664" "11255" "100" "3061" "23284" "8379"

$V3
[1] "7337" "7249" "1454" "8864" "1644" "367" "408" "160851" "53353" "7534" "60468" "11329" "4131" "23358" "4985" "150" "51422" "26986" "7248" "154" "156"
[22] "3569" "5744" "5020" "5021" "6285" "409" "153" "7054" "6622" "2911" "5607" "51256" "2932" "2308" "7166" "155" "151" "2119" "4986" "5187" "3360"
[43] "54814" "6744" "8543" "6921" "51339" "6157" "55898" "23122" "5606" "9770" "4733" "1128" "15"

$V4
[1] "64130" "10458" "2904" "3603" "2903" "25970" "9856" "5793" "8825" "5795" "3643" "55327" "815" "1956" "152189" "4582" "57554" "818" "3358" "5587" "238"
[22] "23040" "23317" "3964" "9647" "1496" "2069" "203859"

$V5
[1] "6257" "7376" "6256" "6258" "816" "64324" "5914" "5916" "22954" "5294" "9575" "7068" "406" "23090" "2908" "4929" "5915" "3557" "4306" "5799" "7915"
[22] "6447" "255488" "51574"

$V7
[1] "181" "4160" "4159" "51738" "3953" "3952" "2255" "5443" "4889" "4852" "4887" "4886" "2693" "5697" "5739"

$V8
[1] "56160" "4804" "310" "5582" "10782" "4909" "4915" "5079" "3398" "7091" "4803" "4316" "3356" "4916" "627" "7088" "55502" "3280" "4908" "22986" "1104" "2004" "2915" "26230"
[25] "27037" "5592" "9555" "11165" "7905"

$V9
[1] "2242" "4188" "2775" "152" "4924" "2781" "10963" "29761" "114781" "4604" "1816" "5702" "3354" "6334" "4692" "10922" "60496" "22822" "7514" "10049" "527"
[22] "85358" "7979" "55174" "875" "57118" "1717" "3992"

$V10
[1] "4988" "3692" "157" "28964" "4122" "52" "2272" "217" "259266"

$V11
[1] "3746" "170850" "8328" "288" "355" "3912" "836" "23017" "8997" "9577" "6885" "2353" "7124" "51559" "842" "317" "284058" "27185" "7204" "9467" "9360"
[22] "84062" "119392"

$V12
[1] "1936" "3672" "4146" "5045" "2318" "11061" "114815" "148" "147" "146" "596" "4842" "9863" "2895" "6647" "154810" "84623" "2950" "50859" "1012"

$V13
[1] "6311" "5730" "6310" "54715" "658" "57522" "55120" "58473" "5218" "91752" "84970" "1509" "3482" "51567" "1139" "79047"

I'm wondering what is that I am doing wrong.
Thank you for any suggestion.

All the best!
Emilia

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.