karissawhiting / oncokbr Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 4.0 15.6 MB

Annotate mutation, copy number alteration and structural variant data in R using oncoKB Annotation API

Home Page: http://www.karissawhiting.com/oncokbR/

License: Other

R 99.85% Rez 0.15%

r-package

oncokbr's People

Contributors

Stargazers

Watchers

Forkers

slb2240 inodb jflynn264 harryjerryzhu

oncokbr's Issues

All examples using annotate_mutations(oncokbR::blca_mutations[1:10, ]) produce error bc missing tumor_type var

Check results returned with and without tumor type parameter and document differences

When you run the functions with and without the tumor type parameter you get different columns of data back. This is to be expected but we should double check data is consistent between them and also check and document what is being returned in each.

We may want to add some info/warning messages accordingly.

library(oncokbR)
library(dplyr)

blca_mutation <- oncokbR::blca_mutation %>%
  mutate(tumor_type = "BLCA")

annotated_with_type <- annotate_mutations(mutations = blca_mutation[1:50,])

blca_mutation <- oncokbR::blca_mutation

annotated_no_type <- annotate_mutations(mutations = blca_mutation[1:50,])

annotated_no_type %>%
  select(oncogenic) %>% 
  table()

Make package site

Compare Annotated Results from CDSI, Python Annotator and {oncokbR}

Here is how I think we can approach this:

Select any set of samples from MSK IMPACT and use {cbioportalR} to get their raw mutation, cna, and fusion data (see https://www.karissawhiting.com/cbioportalR/ for how to do this). Perhaps we can start with ~ 500 IMPACT samples (of any type). Use study_id = "mskimpact" to pull this data, but may want to select sample IDs first ( available_samples("mskimpact") may be of help here) because pulling the entire MSK IMPACT study at once will take a long time (over 100,000 samples)
Run the mutation, cna and fusion data for the selected samples through {oncokbR} to get their resulting oncokb annotations (see oncokbR documentation on how to do this). Please document any issues, difficulties or confusing things you encounter while doing this so we can use them to improve the package.
Run the same select data through the Python annotator (instructions on how to do this on their Github: https://github.com/oncokb/oncokb-annotator) - (We can do this later on)
Pull entire oncoKB annotated data from CDSI (https://github.mskcc.org/cdsi/oncokb-annotated-msk-impact) and filter to our selected samples
NOTE: Results from the {oncokbR} and python annotator will change if you pass a Tumor Type column. This is expected as it is able to annotate more specifically if you give cancer type info. I'm assuming CDSI is, by default, annotated with tumor type info included. We should test results both ways (passing tumor type info, and not).

Compare resulting oncoKB annotation data between all three sources above.

Some Background Info

This cBioPortal Documentation Page has more info on what specific fields mean in the Mutation data (MAF files), CNA data (Discrete Copy Number Data) and Structural Variant Data (aka fusions): https://docs.cbioportal.org/file-formats/
The oncokbR::names_df column has alternate names for columns that sometimes come up in MAF and other raw files. For example, sometimes chromosome number is indicated in MAF as Chromosome, and sometimes as chr
Some notes I gathered on early data validation attempts that may or may not be relevant: https://github.com/karissawhiting/oncokbR/wiki
This genie vignette has a good overview of the whole data processing pipeline (minus oncokb): https://mskcc-epi-bio.github.io/gnomeR/articles/genie-bpc-vignette.html

Check SV processing

Assume all structural variants are functions (this mirrors behavior in

https://github.com/oncokb/oncokb-annotator/blob/47e4a158ee843ead75445982532eb149db7f3106/AnnotatorCore.py#L1506)

sv <- sv %>%
mutate(is_functional = "true")

Check Intragenic Loss as Fusion Type

1] "P-0008754-T01-IM5" "P-0010067-T01-IM5" "P-0016407-T01-IM6" "P-0019724-T01-IM6"
[5] "P-0022023-T01-IM6" "P-0023965-T01-IM6" "P-0036116-T01-IM6"

Make sure tumor type messaging is clear

I think a warning is thrown for SV but not the others. Add tests for this.

Checks failed, wont run examples

x <- annotate_sv(sv = sv)
Error in rename():
! Can't rename columns that don't exist.
✖ Column hugo_symbol doesn't exist.

The column is called site1HugoSymbol and site2HugoSymbol
oncokbR::blca_sv%>% names()

Add tumor type functionality

OncoKB Versions

It looks like oncoKB has releases fairly frequently: https://www.oncokb.org/news
Could be good to think about incorporating a parameter to access specific versions to improve reproducibility.

In Dec, I had run oncoKB twice in a few days and discovered that the discrepancy across the runs was due to a different data version returned by the API call.

Add unit tests and check functionality & docs of functions

Here are some steps to help you get started contributing to this

If you've never written unit tests, here is a great resource to skim in order to get an introduction to testing https://r-pkgs.org/tests.html
Fork, clone and branch the repository (see github trainings)
Once you've done the above to get your own copy, check that you can install the package and run checks ("Build" tab in RStudio). Try checking the package test coverage with withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report()). This will show you the test coverage of each file and the package overall. One of our goals is to increase this coverage percentage.
Read the documentation for the argument you are testing (it might also need updating).
Start writing tests with correctly specified args passed and also incorrect args passed. Check the following, as well as any common user behavior you can think of:
- When you specify the argument correctly (according to the documentation) does it return what you would expect ?
- When you don’t specify the argument at all, does it return what you expect?
- When you incorrectly specify the arg, does it give you a useful warning or message?
Please add tests to the tests R files (in test folder). For tests you can use internal package (cbioportalR::blca_mutations, etca), a small sample of real data, or create your own fake data.
Additionally, if the function is working as expected but the function documentation is not clear, please edit the documentation (R folder in package). If any functions don’t work as expected, please make or suggest a change to the documentation or the function itself. If you can fix it yourself, please submit a PR with a change. For example, if you think the error/warning messages are unclear, please add warnings to the function. If you can’t fix it yourself, no problem, but please file an issue for your suggestion so we can make sure to address it!

Return hugo gene symbol and gene variant in dataframe returned by annotate_mutations()

variant_classification column not found

did this col get removed from original input data frame in annotate_mutations()?

Check handling of intragenic fusions

Compare with how python annotator recognizes and deals with these:

library(cbioportalR)
set_cbioportal_db("public")
sv <- get_structural_variants_by_study("blca_msk_tcga_2020")
int <- sv %>% filter(site1HugoSymbol == site2HugoSymbol)

Also check 3 gene fusions (see above example)

Questions For cBioPortal

do you need consequence coding for HGVSg?

Update Token Authentication

You need a token to access the oncoKB API (https://www.oncokb.org/api-access). Currently OncokbR assumes the user has an object called ONCOKB_TOKEN saved in their R environment file. This token object is sourced and used within in all annotate functions as follows:

token <- Sys.getenv('ONCOKB_TOKEN')

This is rigid and makes assumptions about the user's setup. Instead, the functions instead allow users to pass a token of their choosing to each function with a token argument. This argument can be set to token = get_oncokb_token() by default. get_oncokb_token() will be a function that can search for an environmental variable by default (see .get_cbioportal_token() here as an example: https://github.com/karissawhiting/cbioportalR/blob/main/R/authenticate.R)

Additionally, we can allow users to set any random token they want for an entire session even if they don't save it in their .Renviron file. The best way to do this is probably setting it in a new environment which persists for that session. See lines 1-41 of https://github.com/karissawhiting/cbioportalR/blob/main/R/authenticate.R for how this can be done.

Lastly, all info about token and authentication should be updated in the oncokbR documentation. See https://github.com/oncokb/oncokb-annotator and https://www.karissawhiting.com/cbioportalR/ for examples of good authentication documentation.

Check oncokbR results against oncokb-annotator file annotation

Get some sample data from cbioportalR and run it through the oncokbR annotator as well as the oncokb-annotator file annotator (https://github.com/oncokb/oncokb-annotator ). Check that there are no discrepancies between the two.

We can pick a few project IDs in cbioportalR to test this out.

Make utility functions to simplify repeated code

Check reference genome
Perhaps for checking variant class allowed levels?

  # Clean Variant Class -----------------------------------------------------

  levels_in_data <- names(table(sv$structural_variant_type))

  allowed_chr_levels <- c("DELETION",
                          "TRANSLOCATION",
                          "DUPLICATION",
                          "INSERTION",
                          "INVERSION",
                          "FUSION",
                          "UNKNOWN")

  all_allowed <- c(allowed_chr_levels, names(allowed_chr_levels))
  not_allowed <- levels_in_data[!levels_in_data %in% all_allowed]

  if(length(not_allowed) > 0) {
    cli::cli_abort(c("Unknown values in {.field variant_class} field: {.val {not_allowed}}",
                     "Must be one of the following: {.val {all_allowed}}"))
  }

consequence map function/checking should be separated out

Annotator returns NAs when processing data from CDSI

Differences in file formats between cbioportal API and CDSI github data cause some annotations missing protein start and stop to be skipped when annotating CDSI data.

Better messaging and figuring out optimal annotator settings for this format of data are needed.