Git Product home page Git Product logo

googlegenomics's People

Contributors

cassiedoll avatar deflaux avatar dtenenba avatar gavinsimpson avatar hpages avatar lshep avatar namn avatar shwetagopaul92 avatar siddharthab avatar sonali-bioc avatar ttriche avatar vjcitn avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

googlegenomics's Issues

Might want to update to the new location...

Hi Sid,

Congratulations on the new location, but probably might want to let Travis know of the new location (Bioconductor/GoogleGenomics), since it's still looking at the old one (https://travis-ci.org/googlegenomics/api-client-r) and cannot build it [i.e. it's saying build inaccessible].

Probably also updating everything else to point to the new one too would be nice - i.e. the Readme still points to:

devtools::install_github("googlegenomics/api-client-r")

Thanks,
~p

Example code motivating seqlevelsStyle() patch

Per #15 and ensuing blahblah, here's the unvarnished code.
If it breaks, you get to keep the pieces. :-)

library(GoogleGenomics)
## 
## may need to do this interactively:
## 
authenticate('client_secrets.json')
## 
## the rest should not require user input 
## 
variant <- getVariants(datasetId = "10473108253681171589", chromosome = "22",
                       start = 16051400, end = 16051500, oneBasedCoord = FALSE,
                       fields = NULL, converter = c)
variantVR <- variantsToVRanges(variant, slStyle = 'UCSC')
## 
## Replace the middle base of a DNAStringSet with the alternate SNP allele
## (only tested for SNPs where both reference and alternate are 1bp wide!)
## There are so very many ways in which this is horrible...
##
replaceWithAlt <- function(stringSet, position, alt) {
  strings <- as.character(stringSet)
  substr(strings, position, position) <- as.character(alt)
  return(DNAStringSet(strings))
}
## 
## Use the BSgenome representation of hg19 to play with the ref/alt sequences:
## One base on each side, what happens when we go from reference to alternate?
## Major issue: how to know what reference assembly the variants come from?
##
library(BSgenome.Hsapiens.UCSC.hg19)
##
## Absent a reference assembly, the above (a guess!) could be very wrong...
## Hence issue #13 in the github repository 
##
variantVR$context <- getSeq(Hsapiens, resize(variantVR, 3, fix='center')) 
variantVR$altcontext <- replaceWithAlt(variantVR$context, 2, alt(variantVR))
show(variantVR)
## 
## what is the consequence (what has typically been "lost" to the reference?)
## just in terms of simple doublets, nothing fancy here 
## 
refCpG <- length( grep('CG', variantVR$context) ) / length( variantVR ) 
altCpG <- length( grep('CG', variantVR$altcontext) ) / length( variantVR )
##
## So half of a randomly selected set of 4 variants are "losses" of CpG sites.
## Given the sample size we obviously can't conclude anything, but purging of 
## CpG sites from the genome (via spontaneous 5mC -> T deamination) is a steady
## process that has heavily depleted the "average" genome of said doublets. 
##
print(paste0('In this sample, ', 
             (altCpG - refCpG)*100, '% of ref-to-alt changes create a CpG.'))
##
## It is interesting (NOT rigorous evidence of any sort) that a randomly picked
## set of SNPs from the 1KGP participants would happen to illustrate the theme.

getting started confusion

I just tried to setup the new R code on my local system and ran into the following error after following the README's code block:

Error in authorizer(authorize_url, is_interactive) : 
  httpuv package required to capture OAuth credentials.

We should fix the readme instructions so that this doesn't happen.

add retrying to all api methods

similar to the java client library, we should use some kind of exponential backoff for all API calls that fail with a >= 500 http error

A few things I experienced

Hi Sid,

Hope all is going well with the new version. I just tested a few things:

1) So because of how oauth-cache.R works, the .httr-oauth file gets saved in the directory where R is started from. Would it be possible to add to authenticate.R a check to see if the .httr-oauth file exists in ~/.store/genomics_r_client/ and if not, then it would initiate the browser authentication and place a copy of the file in ~/.store/genomics_r_client/? Otherwise depending on which directory I am in, then I will always have to authenticate via the browser if the file does not exist locally.

2) When I run the following getReadData query I get the following error, and it does not finish the request - the (...) are there to denote that there were many fetches of read data:

> getReadData(readsetId = "CPHG3MzoCRDRkqXzk7b6l_kB", chromosome ="1", start=100000000, end=101000000)
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZOUtPUE1DZ0EYAiDM0JvNCw
...
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZN2FlUE1DZ0EYAiDM0JvNCw
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZOHF1UE1DZ0EYAiDM0JvNCw
Fetching read data page
Parsing read data page
Error in elementType(x) :
  error in evaluating the argument 'x' in selecting a method for function 'elementType': Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
>

3) If I interrupt (via Ctrl-C) a getReadData request because I would like to start a different one, I can't seem to perform another request after that:

> getReadData(readsetId = "CMvnhpKTFhD04eLE-q2yxnU", chromosome ="1", start=100000000, end=101000000)
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: Ei9DaGREVFhadWFIQkxWRVpvUkRBMFpVeEZMWEV5ZVhodVZSSUJNUmlQMU5jdktBRRgCIIGy8rQI
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: Ei9DaGREVFhadWFIQkxWRVpvUkRBMFpVeEZMWEV5ZVhodVZSSUJNUmpxNU5jdktBQRgCIIGy8rQI
^C
>  getReadData(readsetId = "CMvnhpKTFhD04eLE-q2yxnU", chromosome ="1", start=10000000, end=101000000)
Fetching read data page
Error in function (type, msg, asError = TRUE)  :
  easy handled already used in multi handle

I would need to exit out of R, and resubmit the new getReadData request.

I know this is work in progress, but just wanted to give you a heads up.

Thanks,
Paul

submit this package to BioConductor

This package is still very much a work in progress, but it will be great to get it polished and ready for BioConductor! It needs to be run through a linter also to make sure the style is consistent, per #18

streamline authorization

It would be great if a user only had to think about authorization (e.g. providing a client_secrets file) the first time they install this package and from then on they are all set, regardless of their current working directory.

Document steps for new package releases

After #28 is complete, let's be sure to document the exact steps for a new release much like we have for
utils-java

For example, click on the RStudio "check" button which takes care of running the following:

roxygen2::roxygenize('.', roclets=c('rd'))
R CMD build api-client-r
R CMD check --as-cran GoogleGenomics_0.1.2.tar.gz

and then for BioConductor to pick up the release do . . .

seqlevelsStyle oddities

This works OK:

getVariants(datasetId = "10473108253681171589", chromosome = "22",
start = 16051400, end = 16051500, oneBasedCoord = FALSE,
fields = NULL, converter = c)

Fetching variant page

Variants are now available

...

but this does not:

variant <- getVariants(datasetId = "10473108253681171589", chromosome = "chr22", start = 16051400, end = 16051500, oneBasedCoord = FALSE, fields = NULL, converter = c)

Fetching variant page

Variants are now available

NULL

This is OK:

getReadData(chromosome = "chr19", start = 45411941, end = 45412079,
readsetId = "CJ_ppJ-WCxDxrtDr5fGIhBA=",
endpoint = "https://www.googleapis.com/genomics/v1beta/",
pageToken = NULL)

Fetching read data page

Parsing read data page

Read data is now available

and this is not:

getReadData(chromosome = "19", start = 45411941, end = 45412079,
readsetId = "CJ_ppJ-WCxDxrtDr5fGIhBA=",
endpoint = "https://www.googleapis.com/genomics/v1beta/",
pageToken = NULL)

This sure seems like a job for GenomeInfoDb. Am I missing something obvious?
Also, specifying the reference assembly for each reference sequence (chromosome) would be nice.

Error [...] key 60 (char '<') not in lookup table

We get this rather cryptic error message after exporting variants from a Google genomics object:

granges <- variantsToGRanges(variants)

Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), :
key 60 (char '<') not in lookup table

If I have to guess, it may be due to alternateBases = "<NON_REF>" for reference intervals, when importing variants from a gVCF into the google genomics object.

Data are access-restricted, so I can't provide more information.

Update Bioconductor converters to handle multi-allelic data and data with non-variant segments.

The variant converters currently have trouble with multi-allelic data and non-variant segments. We can work around this by filtering and reshaping the data before sending it to the converters (example below) but it would be better if we pushed correct handling for this into the package.

See also #32 for another example of how non-variant segments can also be expressed in the data.

  variants <- getVariants(datasetId="10473108253681171589", chromosome="22",
              start=50300077, end=50301500)

  # Remove non-variant segments
  only_variants <- Filter(function(v) { 1 <= length(v$alternateBases)}, variants)

  # Convert to biallelic data by truncating alternateBases
  # (this isn't how it should be fixed, its just an example)
  biallelic_variants <- lapply(only_variants, function(v) {
    if(1 < length(v$alternateBases)) {
      v$alternateBases = v$alternateBases[[1]]
    }
    v
  })

  granges <- variantsToGRanges(biallelic_variants)

improve performance

This package is in active development and there are some inefficiencies in a few places (e.g., the code that accumulates multiple pages of variants and reads). But we'll tackle improving performance a bit when the code settles down a bit more.

Problem authenticating with invokeBrowser=F

Hi, I'm trying to authenticate on a server without X11, and whenever I give the "invokeBrowser=F" option I receive the following error message:

Error in authorizer(authorize_url, is_interactive) :
unused argument (is_interactive)

Thanks for looking into this.

Add a method to reset the OAuth token.

Something like the following, but package-internal and tested to ensure

  1. getReads\Variants() return the "you need to authenticate" error until authentication is performed
  2. the oauth flow is triggered by the next call to authenticate()
resetOAuth <- function() {
  if (FALSE != getOption("google_auth_cache_httr")) {
    message(paste("Removing cached OAuth token from", getOption("google_auth_cache_httr")))
    file.remove(getOption("google_auth_cache_httr"))
  }
  message("Removing cached OAuth token from memory.")
  rm("google_token", envir=GoogleGenomics:::.authStore)
}

genome reference assembly extraction from variants/reads objects?

One of the handy defaults in Bioconductor GRanges/VRanges objects is a slot for the genome (reference assembly) of each chromosome (should all be the same for any sane object, of course). This helps prevent comparisons of (e.g.) hg18 and hg19, or hg19 and hg38, data as if on the same coordinate system (a practice which is such a terrible idea that it's essentially never worth allowing).

Unfortunately, this safeguard can't be enforced when no genome is specified.

In the course of adding a default seqlevelsStyle for *Ranges, I realized it would be nice to have the genome reference automatically specified when retrieving variants or aligned reads. This doesn't seem to be possible from the current data structure. What am I overlooking here?

Improve messaging and instructions around authentication.

Right now the follow message is output upon each package attach, regardless of whether the oauth credentials were cached in a prior session. This can be a bit confusing.

GoogleGenomics: Do not forget to authenticate.
  Use GoogleGenomics::authenticate(file="secretsFile.json").
  See method documentation on how to obtain the secretsFile.

It would be nice if a message like that were output only if needed (e.g., if there are no cached credentials)

Additionally, it would be good to output something similar when a request returns an auth error such as:
ERROR: Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup.

coding style

There are some competing style guides for R. Let's decide upon which one to follow or some combination of both and make sure the R package version of the client conforms to the style. (see also #17)

http://bioconductor.org/developers/how-to/coding-style/
https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml

Note that the identifier naming convention of avg.clicks in the Google style guide coincides with the S3 methods naming convention and sometimes causes issues/confusion.

http://stackoverflow.com/questions/6583265/what-does-s3-methods-mean-in-r

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.