shendurelab / cfdna Goto Github PK

View Code? Open in Web Editor NEW

57.0 57.0 29.0 32.15 MB

Analysis of epigenetic signals captured by fragmentation patterns of cell-free DNA

License: MIT License

Python 84.11% R 15.89%

cfdna's People

Contributors

Stargazers

Watchers

cfdna's Issues

Missing line in cfDNA/data/Ensembl_v75/README

Hi,
It seems there is line 4 missing in the README file which leads to an absent file that is needed for consecutive processing.
My guess is that it would be something like:
./extractGeneBody.py | bgzip -c > ensemblv75_canonicalTranscriptIDs.protein_coding.lst.gz

Another issue: cfDNA/WPS_overlays/Maurano_et_al_TFclusters/README seems to refer to data/Maurano_et_al_func_var/hg19.taipale.recluster.counts in line 18. How do I get this file?

Thanks,
Jan

Unable to run plots.R

I am having some difficulties getting plots.R to run upon the FFT summary data output by convert_files.py (everything up to this point worked as expected).

I am using the Protein Atlas file from: http://v13.proteinatlas.org/download/rna.csv.zip. (The link provided in the Human Protein Atlas section of the README is no longer functioning.) It appears that the Protein Atlas file actually used was post-processed, to convert it from comma- to tab-delimited and likely also to rename some columns, based upon the plots.R script.

The main issue I am currently encountering has to do with the first computation of a correlation, on line 36. The issue is that logndata2 has all NAs, since its assignment on line 34 does not work. This appears to be due to the initial assignment of row names (rownames(proteinAtlas) <- proteinAtlas$GeneID; line 6) not working, since there is no GeneID column in my Protein Atlas file.

However, if I change this assignment to instead use the Gene column, it does not work due to non-unique row names since there are multiple FPKM values per gene ID (one per tissue type or cell line).

If I instead re-write line 34 as follows:

    labelledLogndata <- cbind(proteinAtlas[,1], logndata)
    logndata2 <- labelledLogndata[labelledLogndata[,1] %in% fdata[,1],]

to not depend upon row names, I end up with a logndata2 dimension of Z x 5, while fdata has a dimension of Y x 64, where the number of genes are equal, but Z and Y are not. The correlation on line 36 currently uses logndata2 directly, but this results in a non-numeric error. If I instead use logndata2[,3], to obtain the FPKM column, I then obtain an incompatible dimension error. Finally on line 39, logndata2[,"NB.4"] is used, but I do not have any columns with a NB.4 label.

Is there another version of the script which pre-processed the data to resolve this or am I perhaps using an incorrect Protein Atlas file? If you have any suggestions on how to resolve these issues, please let me know.

Unable to find CH01.bam

Hi,

I am currently working on our dataset and especially on the bam file available at http://krishna.gs.washington.edu/download/cfDNA-Nucleosomes/BAMs/ but I can't find CH01.bam on our website.
I am also unable the SRA file that corresponds to this file in the GEO project from NCBI ( GSE71378)
I was only able to get GSE71378_CH01.bb

thanks you

Request for the source of modified samtools

Hi,

The modified samtools provided here could not be use at a CentOS-5.5 cluster. Could you please upload the source code of samtools that we could re-compile on our cluster?

Thanks

extremely large FFT-WPS values

Hi,

I followed the instruction on this github to run the scripts to get extracted WPS from FFT. My codes are below:

sample='sample'
count_dir=/wps/body/${sample}/counts/
fft_dir=/wps/body/${sample}/fft/

mkdir -p ${count_dir}
mkdir -p ${fft_dir}

/expression/extractReadStartsFromBAM_Region_WPS.py --minInsert=120 --maxInsert=180 -i /wps/test.tsv -o ${count_dir}/block_%s.tsv.gz sample.cram

cd ${count_dir}

ls block_*.tsv.gz | xargs -n 500 Rscript /expression/fft_path.R ${count_dir} ${fft_dir}

/expression/convert_files.py -a /wps/gene_body.tsv -t /wps/ -r  /wps/ -p body -i sample

My plasma samples are ~20x, after I read the results, the numbers are just a bit strange to me.

For example, in the block_*_.tsv.gz, looks like all the WPS is negative, which is unlikely, as it seems to me 5th column = 3rd col - 4 col and followed by smoothing?

1       982004  13      1       -27
1       982003  14      1       -27
1       982002  14      0       -27
1       982001  14      0       -27
1       982000  14      0       -27
1       981999  14      0       -27
1       981998  14      0       -27
1       981997  15      2       -27
1       981996  15      1       -27
1       981995  15      0       -27
1       981994  16      1       -27

After running, FFT, the final WPS is extremly large. Is it normal? Does this WPS need to be normalized in some way or can I just directly use it for downstream analsysis? What are the Cov and Starts here?

Freq    Cov     Starts  WPS
279     10219.5254719195        0.0238497884478577      8598.75313616417
273     5982.58718950388        0.0140443430887118      7952.29769624378
268     7466.81576752813        0.0144896445869867      11721.1384595369
262     8846.48370222107        0.022895061625473       15191.918936982
257     7442.184665902  0.0313783453071702      16884.9712819869
252     5797.68161245892        0.0306839303032613      17429.5203876919
248     14007.9825155327        0.0561427608426257      46347.379064728
243     27006.8319166288        0.0855563232912757      84710.8334345234
239     22558.520706278 0.075053714590048       69932.0207777483
234     13085.7261256437        0.0747390689683382      46922.1465931742
230     10086.7106152956        0.0758667625959573      39388.6401764214
226     5792.06498397896        0.068366103385063       24709.960170311
222     4104.03613727378        0.087971946632555       27390.4829599399

Thanks!

Do you want to consider upgrading your python scripts?

Hi,

I find it very hard to execute your scripts.
Are you working on any packages that allow us to do cf-nucleosome analysis?
Thank you.

Canonical Transcript IDs selection

Hello,

I've been trying to get the same Canonical Transcript IDs that you uploaded in cfDNA/data/Ensembl_v75/

Though I don't come to the same list when using the above lists, to select the Canonical Transcript IDs. :

Nor do I come to the same list when selecting the longest transcripts.

Would you mind telling me the way you settled for those particular transcripts ?
Did you use particular attributes obtained with BiomaRt to perform your selection ?

Best regards,
MushuW.

shendurelab / cfdna Goto Github PK

cfdna's People

Contributors

Stargazers

Watchers

Forkers

cfdna's Issues

Missing line in cfDNA/data/Ensembl_v75/README

Unable to run plots.R

Unable to find CH01.bam

Request for the source of modified samtools

extremely large FFT-WPS values

Do you want to consider upgrading your python scripts?

Canonical Transcript IDs selection

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent