Git Product home page Git Product logo

oncokbr's People

Contributors

inodb avatar jflynn264 avatar karissawhiting avatar slb2240 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

oncokbr's Issues

Check results returned with and without tumor type parameter and document differences

When you run the functions with and without the tumor type parameter you get different columns of data back. This is to be expected but we should double check data is consistent between them and also check and document what is being returned in each.

We may want to add some info/warning messages accordingly.

library(oncokbR)
library(dplyr)

blca_mutation <- oncokbR::blca_mutation %>%
  mutate(tumor_type = "BLCA")

annotated_with_type <- annotate_mutations(mutations = blca_mutation[1:50,])

blca_mutation <- oncokbR::blca_mutation

annotated_no_type <- annotate_mutations(mutations = blca_mutation[1:50,])

annotated_no_type %>%
  select(oncogenic) %>% 
  table()

Compare Annotated Results from CDSI, Python Annotator and {oncokbR}

Here is how I think we can approach this:

  1. Select any set of samples from MSK IMPACT and use {cbioportalR} to get their raw mutation, cna, and fusion data (see https://www.karissawhiting.com/cbioportalR/ for how to do this). Perhaps we can start with ~ 500 IMPACT samples (of any type). Use study_id = "mskimpact" to pull this data, but may want to select sample IDs first ( available_samples("mskimpact") may be of help here) because pulling the entire MSK IMPACT study at once will take a long time (over 100,000 samples)

  2. Run the mutation, cna and fusion data for the selected samples through {oncokbR} to get their resulting oncokb annotations (see oncokbR documentation on how to do this). Please document any issues, difficulties or confusing things you encounter while doing this so we can use them to improve the package.

  3. Run the same select data through the Python annotator (instructions on how to do this on their Github: https://github.com/oncokb/oncokb-annotator) - (We can do this later on)

  4. Pull entire oncoKB annotated data from CDSI (https://github.mskcc.org/cdsi/oncokb-annotated-msk-impact) and filter to our selected samples

  5. NOTE: Results from the {oncokbR} and python annotator will change if you pass a Tumor Type column. This is expected as it is able to annotate more specifically if you give cancer type info. I'm assuming CDSI is, by default, annotated with tumor type info included. We should test results both ways (passing tumor type info, and not).

Compare resulting oncoKB annotation data between all three sources above.

Some Background Info

Check Intragenic Loss as Fusion Type

1] "P-0008754-T01-IM5" "P-0010067-T01-IM5" "P-0016407-T01-IM6" "P-0019724-T01-IM6"
[5] "P-0022023-T01-IM6" "P-0023965-T01-IM6" "P-0036116-T01-IM6"

Checks failed, wont run examples

x <- annotate_sv(sv = sv)
Error in rename():
! Can't rename columns that don't exist.
✖ Column hugo_symbol doesn't exist.

The column is called site1HugoSymbol and site2HugoSymbol
oncokbR::blca_sv%>% names()

OncoKB Versions

It looks like oncoKB has releases fairly frequently: https://www.oncokb.org/news
Could be good to think about incorporating a parameter to access specific versions to improve reproducibility.

In Dec, I had run oncoKB twice in a few days and discovered that the discrepancy across the runs was due to a different data version returned by the API call.

Add unit tests and check functionality & docs of functions

Here are some steps to help you get started contributing to this

  1. If you've never written unit tests, here is a great resource to skim in order to get an introduction to testing https://r-pkgs.org/tests.html
  2. Fork, clone and branch the repository (see github trainings)
  3. Once you've done the above to get your own copy, check that you can install the package and run checks ("Build" tab in RStudio). Try checking the package test coverage with withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report()). This will show you the test coverage of each file and the package overall. One of our goals is to increase this coverage percentage.
  4. Read the documentation for the argument you are testing (it might also need updating).
  5. Start writing tests with correctly specified args passed and also incorrect args passed. Check the following, as well as any common user behavior you can think of:
    • When you specify the argument correctly (according to the documentation) does it return what you would expect ?
    • When you don’t specify the argument at all, does it return what you expect?
    • When you incorrectly specify the arg, does it give you a useful warning or message?
  6. Please add tests to the tests R files (in test folder). For tests you can use internal package (cbioportalR::blca_mutations, etca), a small sample of real data, or create your own fake data.
  7. Additionally, if the function is working as expected but the function documentation is not clear, please edit the documentation (R folder in package). If any functions don’t work as expected, please make or suggest a change to the documentation or the function itself. If you can fix it yourself, please submit a PR with a change. For example, if you think the error/warning messages are unclear, please add warnings to the function. If you can’t fix it yourself, no problem, but please file an issue for your suggestion so we can make sure to address it!

Check handling of intragenic fusions

Compare with how python annotator recognizes and deals with these:

library(cbioportalR)
set_cbioportal_db("public")
sv <- get_structural_variants_by_study("blca_msk_tcga_2020")
int <- sv %>% filter(site1HugoSymbol == site2HugoSymbol)

Also check 3 gene fusions (see above example)

Update Token Authentication

You need a token to access the oncoKB API (https://www.oncokb.org/api-access). Currently OncokbR assumes the user has an object called ONCOKB_TOKEN​ saved in their R environment file. This token object is sourced and used within in all annotate functions as follows:

token <- Sys.getenv('ONCOKB_TOKEN')​

This is rigid and makes assumptions about the user's setup. Instead, the functions instead allow users to pass a token of their choosing to each function with a token​ argument. This argument can be set to token = get_oncokb_token()​ by default. get_oncokb_token()​ will be a function that can search for an environmental variable by default (see .get_cbioportal_token() here as an example: https://github.com/karissawhiting/cbioportalR/blob/main/R/authenticate.R)

Additionally, we can allow users to set any random token they want for an entire session even if they don't save it in their .Renviron file. The best way to do this is probably setting it in a new environment which persists for that session. See lines 1-41 of https://github.com/karissawhiting/cbioportalR/blob/main/R/authenticate.R for how this can be done.

Lastly, all info about token and authentication should be updated in the oncokbR documentation. See https://github.com/oncokb/oncokb-annotator and https://www.karissawhiting.com/cbioportalR/ for examples of good authentication documentation.

Make utility functions to simplify repeated code

  • Check reference genome
  • Perhaps for checking variant class allowed levels?
  # Clean Variant Class -----------------------------------------------------

  levels_in_data <- names(table(sv$structural_variant_type))

  allowed_chr_levels <- c("DELETION",
                          "TRANSLOCATION",
                          "DUPLICATION",
                          "INSERTION",
                          "INVERSION",
                          "FUSION",
                          "UNKNOWN")

  all_allowed <- c(allowed_chr_levels, names(allowed_chr_levels))
  not_allowed <- levels_in_data[!levels_in_data %in% all_allowed]

  if(length(not_allowed) > 0) {
    cli::cli_abort(c("Unknown values in {.field variant_class} field: {.val {not_allowed}}",
                     "Must be one of the following: {.val {all_allowed}}"))
  }
  • consequence map function/checking should be separated out

Annotator returns NAs when processing data from CDSI

Differences in file formats between cbioportal API and CDSI github data cause some annotations missing protein start and stop to be skipped when annotating CDSI data.

Better messaging and figuring out optimal annotator settings for this format of data are needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.