zhouzilu / dendro Goto Github PK

View Code? Open in Web Editor NEW

33.0 1.0 6.0 20.98 MB

Genetic Heterogeneity Profiling by Single Cell RNA Sequencing

License: GNU General Public License v3.0

R 61.07% Shell 38.93%

single-cell tumor-heterogeneity statistics computational-biology bioinformatics

dendro's Introduction

Zilu Zhou

I’m currently a Senior Data Scientist working at Google. My works focus on statistical modeling and inference, in the context of model AB testing and causal inference. Previously, I received my Ph.D degree in Genomics and Computational Biology and M.A. in Statistics from University of Pennsylvania. My Ph.D thesis is Statistical Methods For Multi-Omics Inference From Single Cell Transcriptome.

My Email: zhouzilu "at" pennmedicine "dot" upenn "dot" edu

dendro's People

Contributors

Stargazers

Watchers

Forkers

wangdi2014 huaichao2018 bit-vs-it lizamathews 2019surbhi tqh003

dendro's Issues

Reference Genome

Hi Zhou,
Your DENDRO is so amazing. I'm wondering is there any specific requirement of the reference genome? Is the Ensemble reference genome that you used in the sample test?
Thanks.

Some questions about DENDRO_demo

Hi Zilu,
Recently I have tried your wonderful DENDRO's demo by follow (https://github.com/zhouzilu/DENDRO/blob/master/vignette/DENDRO_vignette.Rmd), it works well and generate some nice plots, but when I check the count matrix in 'demo' object, I found some things hard to explain, first is some values in N matrix (total read depth) is less than X matrix (allele read depth). Second is some position have 0 in N matrix, but 1 in Z matrix (mutation or not). Those things are conflict with my knowledge. So could please upload gvcf file of the demo to let people check this file? My DENDRO version is 0.2.2, R version is 3.6.0

The upper dimensional limit of X, N, Z matrix

Hi Zilu,
when I run the DENDRO for building a parsimony tree about the dataset of liver cancer, the step of distance calculation step of DENDRO（demo_qc$dist = DENDRO.dist(demo_qc$X,demo_qc$N,demo_qc$Z,show.progress=FALSE)）has run about 3 days and is still running. The dimension of the filtered matrix X, N, Z after doing FilterCellMutation step is 2705*4533(2705 mutation sites and 4533 cells). So I would like to know The upper dimensional limit of X, N, Z matrix, and whether my dataset can run out and produce the rusult.

I am looking forward your reply.
best wishes.

Is mutation detection on matched whole exome sequencing data a better solution?

Hi Zilu Zhou:

Thank you for developing the software.

I would like to ask if detect SNA using matched WES data ( which means I have scRNA data and WES data from the same bulk tissue ) is a better VCF input？

Many Thanks

Ziwei Wang

STAR and GATK example

Hi Zilu,
This tool looks very useful. Would you be able to provide example settings used for STAR and GATK?
Thanks

DENDRO.recalculate

when I run the DENDRO.recalculate function, the result of An mutation indicator matrix Z is not 1 or 0, but the value of 1,2 or 3

Extracting AD (read depth for each allele) in vcf_to_DENDROinput.R might not cover all mutated alleles

Hi Zilu,

Excellent tool for subclone clustering! I'm trying to integrate in my project and just checked the vcf_to_DENDROinput.R. In the following function:
exinfo_x <- function(info){
return(sapply(strsplit(info,':'),function(x){ifelse(x[GT_pos]=='./.',NA,as.numeric(strsplit(x[AD_pos],',')[[1]][2]))},simplify=T))
}

It would only extract the read depth of the second allele presented in the .vcf file, but in my case, there is a sample with 1/2:0,5,8:13:99:666,299,267,218,0,187.

It has two different alleles and all have a certain level of expression. But through the script, the second allele won't be counted.

Maybe we could add all except allele 0?

Looking forward to discussing more!

Error Installing DENDRO

Thank you for making such a package.
I am really interested in using DENDRO in my project. But I am facing an error while installing the package. I also installed Biobase but the installation always says that Biobase is not available. I installed TailRank when it appeared in the error but R keeps giving me the same error. I would appreciate any help.
R version 4.0.4.

Downloading GitHub repo zhouzilu/DENDRO@HEAD
Skipping 1 packages not available: Biobase
√  checking for file 'C:\Users\Moutaz Hwlal\AppData\Local\Temp\RtmpkZyulo\remotes2dd433242e94\zhouzilu-DENDRO-a9134a8/DESCRIPTION' ...
-  preparing 'DENDRO': (1000ms)
√  checking DESCRIPTION meta-information ...
-  checking for LF line-endings in source and make files and shell scripts
-  checking for empty or unneeded directories
-  building 'DENDRO_0.2.2.tar.gz'
   
Installing package into ‘C:/Users/Moutaz Hwlal/OneDrive/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
* installing *source* package 'DENDRO' ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
Error: (converted from warning) package 'TailRank' was built under R version 4.0.4
Execution halted
ERROR: lazy loading failed for package 'DENDRO'
* removing 'C:/Users/Moutaz Hwlal/OneDrive/Documents/R/win-library/4.0/DENDRO'
Error: Failed to install 'DENDRO' from GitHub:
  (converted from warning) installation of package ‘C:/Users/MOUTAZ~1/AppData/Local/Temp/RtmpkZyulo/file2dd469c36fca/DENDRO_0.2.2.tar.gz’ had non-zero exit status

extract the variant information

Hi Zilu,
the DENDRO is an wonderful tool. It may be very useful to me. But I have a big problem when I am building the pipeline.
problem:

in the example script you show me, you call the indels based on the sample resolution, right? and then you get a GVCF file which contains indels called from the sample.bam(like SRR5023621.bam).So how do you extract the indels information based on cell
how do you get the three matrix based on the gvcf you get after calling the indels?
我不知道我有没有表述清楚，所以用中文在问一遍：
首先，基于您给的脚本例中，我能理解您的每个步骤，我有个疑问就是，您是基于每个样本去call的indels，每个样本是包含很多细胞的，且每个样本是并未做细胞分型的混合的样本（不知道我理解的对不对），那call出来的gvcf是基于一个样本的indel信息，那怎么提取出每个细胞的indels的信息呢？
就是R包的输入的三个矩阵分别是怎么得到的X, N, Z
Thanks.

10x scRNA-seq

Hello,

Thanks for this resource.
I have a scRNA-seq dataset with close to 10,000 cells from droplet based sequencing (10x chromium).
How does the pipeline handle large number of cells from droplet based sequencing?

Thanks!
Brian

reading GCVF to R

Hi Zilu, how can one read the GCVF results? In particular, how is the mutation profile matrix Z defined according to the GCVF results? Many thanks in advance.

BAM file from CellRanger

Hello,

in the process of generating VCF file, can I use the 'possorted_genome_bam.bam' file that is returned by the CellRanger Count software? Or is it recommended to redo the mapping step?

Does the pipeline accept VCF files generated in a different way than using PicardTools and GATK, e.g. strelka2, freebayes, vartrix, or simply samtools mpileup?

X matrix is not two dimensional

Hi,

I used the X,N,Z matrix extraction from VCF script from your Github portal. Upon checking the objects created, I realized that the X object is a vector and what is expected in the Dendro Input is a 2D-matrix.

Could you please tell me what am I doing incorrectly?

Format of my vcf file:

FORMAT	unknown
GT:DP:AD:RO:QR:AO:QA:GL	0/0:25:23,2:23:878:2:78:0,-1.37399,-70.1749
GT:DP:AD:RO:QR:AO:QA:GL	1/1:447:1,445:1:40:445:17230:-1418,-131.177,0
GT:DP:AD:RO:QR:AO:QA:GL	1/1:2013:2,2010:2:80:2010:78744:-6991.24,-598.981,0
GT:DP:AD:RO:QR:AO:QA:GL	1/1:128:5,123:5:196:123:4798:-381.42,-21.895,0
GT:DP:AD:RO:QR:AO:QA:GL	1/1:26:0,26:0:0:26:1012:-89.5765,-7.82678,0
GT:DP:AD:RO:QR:AO:QA:GL	1/1:185:1,184:1:27:184:6978:-621.148,-52.9908,0
GT:DP:AD:RO:QR:AO:QA:GL	1/1:7610:2,7608:2:78:7608:295446:-26024,-2283.57,0
GT:DP:AD:RO:QR:AO:QA:GL	0/2:25:6,2,17:6:228:2,17:78,606:-45.3378,-41.3619,-57.9117,0,-11.3651,-12.0292

Any help/suggestion would be appreciated

Thanks

is DENDRO limited by the type of sequencing protocol ?

Hello, thank you for developing this package.

I would like to use DENDRO in my project but I was faced with the question Can DENDRO be used on data generated from a 3'- tag sequencing protocol like Chromium from 10x Genomics? are there any differences regarding the results between data generated from a whole length RNA sequencing protocol and 3'- tag sequencing protocol?

I would appreciate any help

VCF file preparation for plate-based scRNA-seq

Hi,
I am very interested in using Dendro for my scRNA-seq dataset which was generated using plate-based Celseq2 method.
I have 4 different conditions, each 3 plates of 384 cells. How does the cell identification work within the pipeline? Do I need to make a separate VCF file per cell?
Thank you very much for your answer.

Cheers,
Dyah