Git Product home page Git Product logo

hotspots's Introduction

#Identifying recurrent mutations in cancer

Software and dataset

Description:

This is a method to identify population-scale recurrent mutations in cancer based on a binomial statisical model that incoporates underlying mutational processes including nucleotide context mutability, gene-specific mutation rates, and major expected patterns of hotspot mutation emergence

Dependencies:

Need R Version 3.0.2 or higher Install dependent packages (data.table, IRanges, BSgenome.Hsapiens.UCSC.hg19) as follows:

install.packages("data.table")
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges","BSgenome.Hsapiens.UCSC.hg19")

####Usage:

./hotspot_algo.R
    --input-maf=[REQUIRED: mutation file]
    --rdata=[REQUIRED: Rdata object with necessary files for algorithm]
    --output-file=[REQUIRED: output file to print statistically significant hotspots]
    --gene-query=[OPTIONAL (default=all genes in mutation file): List of Hugo Symbol in which to query for hotspots]
    --homopolymer=[OPTIONAL (default=TRUE): TRUE|FALSE filter hotspot mutations in homopolymer regions]
    --filter-centerbias=[OPTIONAL (default=FALSE): TRUE|FALSE to identify false positive filtering based on mutation calling center bias]
    --align100mer=[OPTIONAL: BED file of hg19 UCSC alignability track for 100-mer length sequences for false positive filtering]
    --align24mer=[OPTIONAL: BED file of hg19 UCSC alignability track for 24-mer length sequences for false positive filtering]

Command to run hotspot algorithm on genes listed in file genes_of_interest.txt:

./hotspot_algo.R \
	--input-maf=pancancer_unfiltered.maf \
	--rdata=hotspot_algo.Rdata \
	--gene-query=genes_of_interest.txt \
	--output-file=sig_hotspots.txt

####Contents: [ Required ] hotspot_algo.R - R script to execute hotspot detection algorithm

[ Required ] hotspot_algo.Rdata - Rdata object with necessary files for algorithm (mutability, expression filters, etc)

[ Required ] funcs.R - R script of functions necessary for proper execution of hotspot_algo.R

genes_of_interest.txt - Sample list of genes for hotspot detection

minimalist_test_maf.txt - minimalist MAF needed from maf2maf. mskcc/maf2maf

####Notes: --align100mer and --align24mer are optional filters based on how uniquely k-mer sequences align to a region of the hg19 genome. Note, both filters were used as part of this analysis. See more information at ENCODE Mapability.

The use of these filters will require downloading the 100-mer and 24-mer alignability tracks from UCSC that are not included here: ENCODE CRG Alignability 100-mer ENCODE CRG Alignability 24-mer

Convert these downloaded bigWig to bedgraph format, following instructions here: UCSC BigWig

hotspots's People

Contributors

changmt avatar lordzappo avatar ckandoth avatar alexpenson avatar kpjonsson avatar

Stargazers

Mariana Buongermino Pereira avatar  avatar  avatar Cameron Smith avatar Deepankar Chakroborty avatar B. Arman Aksoy avatar Alexei Matusevski avatar  avatar  avatar Haitao avatar yaozhou avatar hong.fei avatar  avatar Arvind avatar dapeng Liang avatar  avatar Young Shoo avatar  avatar zmiimz avatar  avatar  avatar  avatar Ousman Mahmud avatar Kasia Kedzierska avatar  avatar Lukas avatar Hongxin avatar Akzam Saidin avatar Youcai avatar Joakim Karlsson avatar Yannick Boursin avatar  avatar Satpreet avatar Alexander Goncearenco avatar Venkat Addala avatar Andreas Sjödin avatar

Watchers

James Cloos avatar  avatar  avatar Mathieu Lajoie avatar  avatar Yannick Boursin avatar  avatar  avatar

hotspots's Issues

Error in creating ___tmp.tsv

subprocess.call("bedtools getfasta -tab -fi /ifs/depot/assemblies/H.sapiens/GRCh37/gr37.fasta -bed ___tmp.bed -fo ___tmp.tsv".split(" "))

When I run the code, it gives back an error here. No ___tmp.tsv file in the directory. No such file has been created before. Can you please help me??

Thank you!

unable to run script make_trinuc_maf.py

I've attempted to run the following code from README:

/hotspot_algo.R --input-maf=minimalist_test_maf.txt --rdata=hotspot_algo.Rdata --gene-query=genes_of_interest.txt --output-file=testrun_sig_hotspots.txt

and I've got this:

Reading in MAF...
Prepping MAF for analysis ...
... Ignoring non-SNP mutations
... Making bed file
... Getting regions
Error: The requested fasta database file (/ifs/depot/assemblies/H.sapiens/GRCh37/gr37.fasta) could not be opened. Exiting!
... Adding trinucs (normalized to start from C or T)
... Writing to ___temp_maf-tri.tm
... Cleaning up

It seems that the script require a giant gr37.fasta instead of multiple files. How could I fix this?

filter-centerbias=TRUE is error

hi:
i got a error when try to use minimalist_test_maf.txt and filter-centerbias=TRUE

Annotating mutation calling center bias...
data frame with 0 columns and 0 rows
Error in names(x) <- value :
'names' attribute [7] must be the same length as the vector [0]
Calls: cbind ... eval -> annotate.center.bias -> colnames<- -> colnames<-
Execution halted

Empty output file content means error or nothing?

Hello,
Thank you very much for providing such a useful software. However, when I run my own data, the output result file has nothing but the column name. Empty results indicate that there are no hotspots or errors in operation.

MAF file format

We tried to run this tool for our Lung genome analysis. We used standard MAF with 32 columns and this is not working. Can you please let know which columns in MAF are necessary to run this analysis?
We will appreciate your help !
Thanks,
Ashiq

Code makes assumption not valid for official TCGA MAF

Hi,

I'm getting an error originating from the amino acid length being NA.

It looks like from looking at the internals of the code that you assume the "Protein_position" column should be something like "position/length", where "position" is the amino acid position of the mutation and "length" is the total length of the protein. Despite a MAF file from TCGA containing a "Protein_position" column, it only contains the "position" part and not anything related to the protein length.

Collin

Error handling chromosomes on provided minimal maf file

I wanted to look at the full list of mutations in the output for the provided minimal maf file. However, when I did this by relaxing the q-value threshold it resulted in errors in handling mutations with chromosome "MT" or anything starting with "GL" (these are in the MAF file). I have included an error message below:

Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, : sequence chrGL000205.1 not found Calls: cbind ... lapply -> FUN -> .getOneSeqFromBSgenomeMultipleSequences)

Collin

Error in binom.test_snp function

Hello,

Thank you developing this useful tool. I was able to run the example maf file successfully by downloading the GRCH37 fasta file however when I try to run the hotspot script by giving the my own data:
./hotspot_algo.R --input-maf=data_ext_mut.maf --rdata=hotspot_algo.Rdata --output-file=data_hotspot.maf

I get the following log and error:

No q-value entered. Using default q-value cut-off, 0.01
Using: HOMOPOLYMER_FILTER

Reading in MAF...
Prepping MAF for analysis ...
... Reducing MAF to protein-coding substitutions
... Re-annotating substitutions
... Removing putative germline SNPs based on ExAC r0.2
... Removing unexpressed genes
... Ignoring non-SNP mutations
... Making bed file
... Getting regions
WARNING. chromosome (GL000219.1) was not found in the FASTA file. Skipping.
WARNING. chromosome (GL000219.1) was not found in the FASTA file. Skipping.
WARNING. chromosome (GL000193.1) was not found in the FASTA file. Skipping.
... Adding trinucs (normalized to start from C or T)
... Writing to ___temp_maf-tri.tm
... Cleaning up

Finished prepping MAF!
Running algorithm...
Error in rep(1, aa.length - length(pval)) : invalid 'times' argument
Calls: lapply -> lapply -> FUN
Execution halted

Further investigation of the code seems to indicate the error is from the binom.test_snp function but I am not sure why I'm getting it. How do I got about identifying part that seems to break and resolve this issue?

Thanks,
Krutika

MAF file unavailable

Hello,

I tried to get the MAF file using the Google Drive link, but I get a Not Found error (404).

Thanks

Problem with download the full mutational data

Access to the full mutational data can be obtained here: https://drive.google.com/open?id=0B1tQDSL9FmNLTmo1dl9SRF9USUE
I tried to open the link using my Google account, but I encountered an error. Could you please grant me permission to access the file?

how to get indel hotspot with this software

hi sir:
there are many indel hotspot discovery in the paper of "Accelerating discovery of functional mutant alleles in cancer" as below:
#================
We identified 1,165 statistically significant hotspot mutations of which 80% arose in 1 in 1,000 or fewer patients. Of 55 recurrent in-frame indels, we validated that novel AKT1 duplications induced pathway hyperactivation and conferred AKT inhibitor sensitivity.
#=================
but the hotspot soft will Discard the ins and del in soft script.
how can i get indel hotpsot in mydata?
thanks

Question about the publication VCFs and BED

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.