karissawhiting / oncokbr Goto Github PK
View Code? Open in Web Editor NEWAnnotate mutation, copy number alteration and structural variant data in R using oncoKB Annotation API
Home Page: http://www.karissawhiting.com/oncokbR/
License: Other
Annotate mutation, copy number alteration and structural variant data in R using oncoKB Annotation API
Home Page: http://www.karissawhiting.com/oncokbR/
License: Other
When you run the functions with and without the tumor type parameter you get different columns of data back. This is to be expected but we should double check data is consistent between them and also check and document what is being returned in each.
We may want to add some info/warning messages accordingly.
library(oncokbR)
library(dplyr)
blca_mutation <- oncokbR::blca_mutation %>%
mutate(tumor_type = "BLCA")
annotated_with_type <- annotate_mutations(mutations = blca_mutation[1:50,])
blca_mutation <- oncokbR::blca_mutation
annotated_no_type <- annotate_mutations(mutations = blca_mutation[1:50,])
annotated_no_type %>%
select(oncogenic) %>%
table()
Here is how I think we can approach this:
Select any set of samples from MSK IMPACT and use {cbioportalR} to get their raw mutation, cna, and fusion data (see https://www.karissawhiting.com/cbioportalR/ for how to do this). Perhaps we can start with ~ 500 IMPACT samples (of any type). Use study_id = "mskimpact"
to pull this data, but may want to select sample IDs first ( available_samples("mskimpact")
may be of help here) because pulling the entire MSK IMPACT study at once will take a long time (over 100,000 samples)
Run the mutation, cna and fusion data for the selected samples through {oncokbR} to get their resulting oncokb annotations (see oncokbR documentation on how to do this). Please document any issues, difficulties or confusing things you encounter while doing this so we can use them to improve the package.
Run the same select data through the Python annotator (instructions on how to do this on their Github: https://github.com/oncokb/oncokb-annotator) - (We can do this later on)
Pull entire oncoKB annotated data from CDSI (https://github.mskcc.org/cdsi/oncokb-annotated-msk-impact) and filter to our selected samples
NOTE: Results from the {oncokbR} and python annotator will change if you pass a Tumor Type column. This is expected as it is able to annotate more specifically if you give cancer type info. I'm assuming CDSI is, by default, annotated with tumor type info included. We should test results both ways (passing tumor type info, and not).
Compare resulting oncoKB annotation data between all three sources above.
Some Background Info
oncokbR::names_df
column has alternate names for columns that sometimes come up in MAF and other raw files. For example, sometimes chromosome number is indicated in MAF as Chromosome
, and sometimes as chr
sv <- sv %>%
mutate(is_functional = "true")
1] "P-0008754-T01-IM5" "P-0010067-T01-IM5" "P-0016407-T01-IM6" "P-0019724-T01-IM6"
[5] "P-0022023-T01-IM6" "P-0023965-T01-IM6" "P-0036116-T01-IM6"
I think a warning is thrown for SV but not the others. Add tests for this.
x <- annotate_sv(sv = sv)
Error in rename()
:
! Can't rename columns that don't exist.
✖ Column hugo_symbol
doesn't exist.
The column is called site1HugoSymbol and site2HugoSymbol
oncokbR::blca_sv%>% names()
It looks like oncoKB has releases fairly frequently: https://www.oncokb.org/news
Could be good to think about incorporating a parameter to access specific versions to improve reproducibility.
In Dec, I had run oncoKB twice in a few days and discovered that the discrepancy across the runs was due to a different data version returned by the API call.
withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report())
. This will show you the test coverage of each file and the package overall. One of our goals is to increase this coverage percentage.did this col get removed from original input data frame in annotate_mutations()?
Compare with how python annotator recognizes and deals with these:
library(cbioportalR)
set_cbioportal_db("public")
sv <- get_structural_variants_by_study("blca_msk_tcga_2020")
int <- sv %>% filter(site1HugoSymbol == site2HugoSymbol)
Also check 3 gene fusions (see above example)
do you need consequence coding for HGVSg?
You need a token to access the oncoKB API (https://www.oncokb.org/api-access). Currently OncokbR assumes the user has an object called ONCOKB_TOKEN saved in their R environment file. This token object is sourced and used within in all annotate functions as follows:
token <- Sys.getenv('ONCOKB_TOKEN')
This is rigid and makes assumptions about the user's setup. Instead, the functions instead allow users to pass a token of their choosing to each function with a token argument. This argument can be set to token = get_oncokb_token()
by default. get_oncokb_token()
will be a function that can search for an environmental variable by default (see .get_cbioportal_token()
here as an example: https://github.com/karissawhiting/cbioportalR/blob/main/R/authenticate.R)
Additionally, we can allow users to set any random token they want for an entire session even if they don't save it in their .Renviron
file. The best way to do this is probably setting it in a new environment which persists for that session. See lines 1-41 of https://github.com/karissawhiting/cbioportalR/blob/main/R/authenticate.R for how this can be done.
Lastly, all info about token and authentication should be updated in the oncokbR documentation. See https://github.com/oncokb/oncokb-annotator and https://www.karissawhiting.com/cbioportalR/ for examples of good authentication documentation.
Get some sample data from cbioportalR and run it through the oncokbR annotator as well as the oncokb-annotator file annotator (https://github.com/oncokb/oncokb-annotator ). Check that there are no discrepancies between the two.
We can pick a few project IDs in cbioportalR to test this out.
# Clean Variant Class -----------------------------------------------------
levels_in_data <- names(table(sv$structural_variant_type))
allowed_chr_levels <- c("DELETION",
"TRANSLOCATION",
"DUPLICATION",
"INSERTION",
"INVERSION",
"FUSION",
"UNKNOWN")
all_allowed <- c(allowed_chr_levels, names(allowed_chr_levels))
not_allowed <- levels_in_data[!levels_in_data %in% all_allowed]
if(length(not_allowed) > 0) {
cli::cli_abort(c("Unknown values in {.field variant_class} field: {.val {not_allowed}}",
"Must be one of the following: {.val {all_allowed}}"))
}
Differences in file formats between cbioportal API and CDSI github data cause some annotations missing protein start and stop to be skipped when annotating CDSI data.
Better messaging and figuring out optimal annotator settings for this format of data are needed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.