gfellerlab / epic Goto Github PK

Repository for the R package EPIC, to Estimate the Proportion of Immune and Cancer cells from bulk gene expression data.

Home Page: https://gfellerlab.shinyapps.io/EPIC_1-1/

License: Other

R 100.00%

cancer-cells gene-expression bulk-data rna-seq cell-type

epic's Introduction

EPIC package

Description

Package implementing EPIC method to estimate the proportion of immune, stromal, endothelial and cancer or other cells from bulk gene expression data. It is based on reference gene expression profiles for the main non-malignant cell types and it predicts the proportion of these cells and of the remaining “other cells” (that are mostly cancer cells) for which no reference profile is given.

This method is described in the publication from Racle et al., 2017 available at https://elifesciences.org/articles/26476.

EPIC is also available as a web application: http://epic.gfellerlab.org.

Usage

The main function in this package is EPIC. It needs as input a matrix of the TPM (or RPKM) gene expression from the samples for which to estimate cell proportions. One can also define the reference cells to use

# library(EPIC) ## If the package isn't loaded (or use EPIC::EPIC and so on).
out <- EPIC(bulk = bulkSamplesMatrix)
out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList)

out is a list containing the various mRNA and cell fractions in each sample as well as some data.frame of the goodness of fit.

Values of mRNA per cell and signature genes to use can also be changed:

out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList, mRNA_cell = mRNA_cell_vector, sigGenes = sigGenes_vector)
out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList, mRNA_cell_sub = mRNA_cell_sub_vector)

Various other options are available and are well documented in the help pages from EPIC:

?EPIC::EPIC
?EPIC::EPIC.package

Installation

install.packages("devtools")
devtools::install_github("GfellerLab/EPIC", build_vignettes=TRUE)

Web application

EPIC is also available as a web application: http://epic.gfellerlab.org.

Python wrapper

A pyhton wrapper has been written by Stephen C. Van Nostrand from MIT and is available at https://github.com/scvannost/epicpy.

License

EPIC can be used freely by academic groups for non-commercial purposes. The product is provided free of charge, and, therefore, on an “as is” basis, without warranty of any kind. Please read the file “LICENSE” for details.

If you plan to use EPIC (version 1.1) in any for-profit application, you are required to obtain a separate license. To do so, please contact Nadette Bulgin ([email protected]) at the Ludwig Institute for Cancer Research Ltd.

Contact information

Julien Racle ([email protected]), and David Gfeller ([email protected]).

FAQ

Which proportions returned by EPIC should I use?

EPIC is returning two proportion values: mRNAProportions and cellFractions, where the 2nd represents the true proportion of cells coming from the different cell types when considering differences in mRNA expression between cell types. So in principle, it is best to consider these cellFractions.

However, please note, that when the goal is to benchmark EPIC predictions, if the ‘bulk samples’ correspond in fact to in silico samples reconstructed for example from single-cell RNA-seq data, then it is usually better to compare the ‘true’ proportions against the mRNAProportions from EPIC. Indeed, when building such in silico samples, the fact that different cell types express different amount of mRNA is usually not taken into account. On the other side, if working with true bulk samples, then you should compare the true cell proportions (measured e.g., by FACS) against the cellFractions.

What do the “other cells” represent?

EPIC predicts the proportions of the various cell types for which we have gene expression reference profiles (and corresponding gene signatures). But, depending on the bulk sample, it is possible that some other cell types are present for which we don’t have any reference profile. EPIC returns the proportion of these remaining cells under the name “other cells”. In the case of tumor samples, most of these other cells would certainly correspond to the cancer cells, but it could be that there are also some stromal cells or epithelial cells for example.

I receive an error message “attempt to set ‘colnames’ on an object with less than two dimensions”. What can I do?

This is certainly that some of your data is a vector instead of a matrix. Please make sure that your bulk data is in the form of a matrix (and also your reference gene expression profiles if using custom ones).

Is there some caution to consider about the cellFractions and mRNA_cell values?

As described in our manuscript, EPIC first estimates the proportion of mRNA per cell type in the bulk and then it uses the fact that some cell types have more mRNA copies per cell than other to normalize this and obtain an estimate of the proportion of cells instead of mRNA (EPIC function returns both information if you need the one or the other). For this normalization we had either measured the amount of mRNA per cell or found it in the literature (fig. 1 – fig. supplement 2 of our paper). However we don’t currently have such values for the endothelial cells and CAFs. Therefore for these two cell types, we use an average value, which might not reflect their true value and this could bias a bit the predictions, especially for these cell types. If you have some values for these mRNA/cell abundances, you can also add them into EPIC, with help of the parameter “mRNA_cell” or “mRNA_cell_sub” (and that would be great to share these values).

If the mRNA proportions of these cell types are low, then even if you don’t correct the results with their true mRNA/cell abundances, it would not really have a big impact on the results. On the other side, if there are many of these cells in your bulk sample, the results might be a little bit biased, but the effect should be similar for all samples and thus not have a too big importance (maybe you wouldn’t be fully able to tell if there are more CAFs than Tcells for example, but you should still have a good estimate of which sample has more CAFs (or Tcells) than which other sample for example).

I receive a warning message that “the optimization didn’t fully converge for some samples”. What does it mean?

When estimating the cell proportions EPIC performs a least square regression between the observed expression of the signature genes and the expression of these genes predicted based on the estimated proportions and gene expression reference profiles of the various cell types.

When such a warning message appears, it means that the optimization didn’t manage to fully converge for this regression, for some of the samples. You can then check the “fit.gof$convergeCode” (and possibly also “fit.gof$convergeMessage”) that is outputted by EPIC alongside the cell proportions. This will tell you which samples had issue with the convergence (a value of 0 means it converged ok, while other values are errors/warnings, their meaning can be found in the help of “optim” (or “constrOptim”) function from R (from “stats” package) which is used during the optimization and we simply forward the message it returns).

The error code that usually comes is a “1” which means that the maximum number of iterations has been reached in the optimization. This could mean there is an issue with the bulk gene expression data that maybe don’t completely follow the assumption of equation (1) from our manuscript. From our experience, it seems in practice that even when there was such a warning message the proportions were predicted well, it is maybe that the optimization just wants to be too precise, or maybe few of the signature genes didn’t match well but the rest of signature genes could be used to have a good estimate of the proportions.

If you have some samples that seem to have strange results, it could however be useful to check that the issue is not that these samples didn’t converge well. To be more conservative you could also remove all the samples that didn’t converge well as these are maybe outliers, if it is only a small fraction from your original samples. Another possibility would be to change the parameters of the optim/constrOptim function to allow for more iterations or maybe a weaker tolerance for the convergence, but for this you would need to tweak it directly in the code of EPIC, I didn’t implement such option for EPIC.

Who should I contact in case of a technical or other issue?

Julien Racle ([email protected]). Please provide as much details as possible and ideally send also an example input file (and/or reference profiles) that is causing the issue.

epic's People

Contributors

Stargazers

Watchers

epic's Issues

mRNA_cell value unknown for some cell types: CAFs, Endothelial...

Hi - I get a warning message when running EPIC (input dataset is TCGA RNAseq FPKM). Please advise on a resolution.

Warning message:
In EPIC(bulk = bulkSamplesMatrix) :
mRNA_cell value unknown for some cell types: CAFs, Endothelial - using the default value of 0.4 for these but this might bias the true cell proportions from all cell types.

Custom Signature genes

Hi,
I was wondering if I can provide a custom signature gene file in epic?
If yes then can you tell me how can I do that ?
Thank you in advance.

ATAC-seq deconvolution

Dear,

Is it possible to use EPIC for cell type deconvolution using ATAC-seq datasets?
It gives the following error when I provide bulk ATAC-seq matrix (rows=peaks, columns=sample/subject) as input.

Error in EPIC::EPIC(bulk = bulk.mtx, sigGenes = rownames(Signature)) :
There are only 0 signature genes matching common genes between bulk and reference profiles, but there should be more signature genes than reference cells

Thank you in advance.

Kind regards,
Seoyeon

Confusion about correlation between EPIC and MCPcounter

Hi, David
I perform EPIC analyses on BRCA and OV RNASeq data from TCGA these days. It confuses me when I run MCPcounter (another method, https://github.com/ebecht/MCPcounter), the results of these two methods about CAF/fibroblast have high correlation in pearson/spearman analysis. I’m wondering if there is some settings I should be careful with?
I attached main code and screenshots:
`results <- MCPcounter.estimate(log2(1+dat_input),featuresType='HUGO_symbols',probesets=probesets1,genes=genes1)
result <- EPIC(dat_input, reference="TRef")
Warning messages:
1: In EPIC(dat_input, reference = "TRef") :
The optimization didn't fully converge for some samples:
......
2: In EPIC(dat_input, reference = "TRef") :
mRNA_cell value unknown for some cell types: CAFs, Endothelial - using the default value of 0.4 for these but this might bias the true cell proportions from all cell types.

ggplot(cells, aes(EPIC, MCP))
#(Other parameters are default)`

(Results from previous study: Risk Signature of Cancer-Associated Fibroblast–Secreted Cytokines Associates With Clinical Outcomes of Breast Cancer)
I‘m eagerly waiting for your reply，thanks！

This version is not compatible with R >=4.0.0

Hi there,
It seems that this version of EPIC could only work on R<=3.6.3. Therefore, I hope there will be a new version which could work on R >=4.0.0.
Many thanks!
yhren

PMBC extracted data (not working)

Hello!

I am trying to run EPIC on some synthetic mixtures extracted from PBMC data and another stromal cell type. We have 100 random selected synthetic mixes for each of the three files we want to run. I have tried both the online tool and the installed tool on cluster; for the online tool, as soon as I upload the bulk mix, the tool says "Disconnected from server" and prompts to reload the page. On the cluster, I am getting the error "Attempt to set colnames on an object with less than two dimensions".

The tool works on some other sets we have tried. Gene names and any file formatting errors have been checked for the PBMC sets as well, but with no avail.

Any help would be greatly appreciated! Thank you.

Best,
Benjamin Shou

Adding TRef and BRef references

Hello,

Thank you for developing this useful tool to identify cell fractions in tumor cells. I have already used the tool using reference datasets and I do get some useful results.
However, an investigator I work with is interested in looking at all immune cells and tumor infiltrating cell fractions in our bulk dataset. Is this possible to combine TRef and BRef from existing reference dataset that come with the package if yes, how should I go about doing this?

If I have to create a custom reference , how would you suggest I go about generating that?

Thank you,
Krutika

Can we used normalised counts as input?

Hi.

I'm keen to use EPIC and read in the instructions that it only accepts TPMs or FKPMs. Is it possible to use normalised counts instead?

Thanks.

Approach for extending TRef profile

Hi there,

I was thinking about the following approach and was wondering on your opinion:

To further refine the identification of cell fractions/ tissue fractions in my bulk RNAseq samples I thought about extending the TRef sample profiles with an additional profile for normal tissue (based on bulk RNAseq of the respective normal tissue) to thus get an even better estimation of tumor purity (= otherCells).
So based on this idea my questions are as following:

Do you think such an approach is feasible?
The only thing I have to do is to add another column representing the normal tissue to TRef$refProfiles and add the respective gene markers to TRef$sigGenes ?
The marker genes for the normal tissue I would determine based on:

Importantly, we do not require our signature genes to be expressed in exactly one cell type, but only to show very low expression in cancer cells

and the extended "Cell marker gene identification" in the methods section?

I guess since the input matrix is in normalized TPM, this is also used for TRef$refProfiles ?

I'm very curious on your feedback!

Thank you!

Error: processing vignette 'EPIC.Rmd' failed with diagnostics:

 devtools::install_github("GfellerLab/EPIC", build_vignettes=TRUE)
Downloading GitHub repo GfellerLab/EPIC@HEAD
✓  checking for file ‘/private/var/folders/h1/wph5l7kn6sq020007z1bq2d00000gn/T/RtmpIhaNLF/remotese2c15a901700/GfellerLab-EPIC-dcdcbc5/DESCRIPTION’ (376ms)
─  preparing ‘EPIC’:
✓  checking DESCRIPTION meta-information ...
─  installing the package to build vignettes
E  creating vignettes (20.6s)
   --- re-building ‘EPIC.Rmd’ using rmarkdown
   dyld: lazy symbol binding failed: Symbol not found: ____chkstk_darwin
     Referenced from: /usr/local/bin/pandoc (which was built for Mac OS X 10.15)
     Expected in: /usr/lib/libSystem.B.dylib
   
   dyld: Symbol not found: ____chkstk_darwin
     Referenced from: /usr/local/bin/pandoc (which was built for Mac OS X 10.15)
     Expected in: /usr/lib/libSystem.B.dylib
   
   Error: processing vignette 'EPIC.Rmd' failed with diagnostics:
   pandoc document conversion failed with error 6
   --- failed re-building ‘EPIC.Rmd’
   
   SUMMARY: processing the following file failed:
     ‘EPIC.Rmd’
   
   Error: Vignette re-building failed.
   Execution halted
Error: Failed to install 'EPIC' from GitHub:
  System command 'R' failed, exit status: 1, stdout + stderr (last 10 lines):
E> 
E> Error: processing vignette 'EPIC.Rmd' failed with diagnostics:
E> pandoc document conversion failed with error 6
E> --- failed re-building ‘EPIC.Rmd’
E> 
E> SUMMARY: processing the following file failed:
E>   ‘EPIC.Rmd’
E> 
E> Error: Vignette re-building failed.
E> Execution halted

Can EPIC accept CPM or microarray data?

Hi there,

It looks like EPIC is designed for RNA-Seq TPM data. Is RNA-Seq CPM data ok?
How about fitting microarray data into EPIC? If possible, is any normalization (like quatile) required?

Many thanks,
Lindsay

other cells in case of PBMC data

Hi
Thank you for this fantastic work. I had several questions while using the package.

I'm trying to deconvolve PBMC derived bulk RNAseq data (14000 genes after ID mapping, 0's removal, etc) over different time points and would like to understand the population dynamics. I used "BRef" as my reference profile. I arrived at other cells being around 60-70 % of the total cell population. I expect it to be much lower than that since my data is from PBMC. What do you think about this? Should I use the option to ignore other cell populations in the EPIC wrapper?
What are the importance of siggenes and how are they useful? I'm sorry if this question is trivial, but I'm not able to understand why they exist as only a list of names in the reference profile.
My dataset has many missing genes (using only 14000 genes while the reference profile has about 49000 genes). How are these missing values accounted for in the algorithm?

Thanks in advance for the answers.

other cells

Hi,

The software is great but I am still confused on what cells are in other cells? can normal cells also be mixed in there with cancer cells or is it assumed that the "other cells" portion is all cancer cells?

mRNA_cell warning

Thanks for providing such great software !
As the paper said that "The renormalization by mRNA content appeared to be important for predicting actual cell fractions" and the code , the mRNA_cell argument looks quite important for cell fraction estimation. If I understand right, these mRNA_cell values should be acquired by some wet-lab experiment (FA analysis for example).
In this setting, what if I cann't get such values for my customized reference profile ? Even for the TRef, I still got warnings about "mRNA_cell value unknown for some cell types and for these but this might bias the true cell proportions from all cell types."". I am wondering taht is there any way to handle such situation ?