bioconductor / googlegenomics Goto Github PK

View Code? Open in Web Editor NEW

42.0 40.0 26.0 388 KB

An R package for Google Genomics API queries.

License: Apache License 2.0

R 56.19% Shell 0.84% M4 5.29% C++ 36.57% C 1.11%

core-package bioconductor-packages retired-package

googlegenomics's People

Contributors

Stargazers

Watchers

googlegenomics's Issues

Might want to update to the new location...

Hi Sid,

Congratulations on the new location, but probably might want to let Travis know of the new location (Bioconductor/GoogleGenomics), since it's still looking at the old one (https://travis-ci.org/googlegenomics/api-client-r) and cannot build it [i.e. it's saying build inaccessible].

Probably also updating everything else to point to the new one too would be nice - i.e. the Readme still points to:

devtools::install_github("googlegenomics/api-client-r")

Thanks,
~p

Example code motivating seqlevelsStyle() patch

Per #15 and ensuing blahblah, here's the unvarnished code.
If it breaks, you get to keep the pieces. :-)

library(GoogleGenomics)
## 
## may need to do this interactively:
## 
authenticate('client_secrets.json')
## 
## the rest should not require user input 
## 
variant <- getVariants(datasetId = "10473108253681171589", chromosome = "22",
                       start = 16051400, end = 16051500, oneBasedCoord = FALSE,
                       fields = NULL, converter = c)
variantVR <- variantsToVRanges(variant, slStyle = 'UCSC')
## 
## Replace the middle base of a DNAStringSet with the alternate SNP allele
## (only tested for SNPs where both reference and alternate are 1bp wide!)
## There are so very many ways in which this is horrible...
##
replaceWithAlt <- function(stringSet, position, alt) {
  strings <- as.character(stringSet)
  substr(strings, position, position) <- as.character(alt)
  return(DNAStringSet(strings))
}
## 
## Use the BSgenome representation of hg19 to play with the ref/alt sequences:
## One base on each side, what happens when we go from reference to alternate?
## Major issue: how to know what reference assembly the variants come from?
##
library(BSgenome.Hsapiens.UCSC.hg19)
##
## Absent a reference assembly, the above (a guess!) could be very wrong...
## Hence issue #13 in the github repository 
##
variantVR$context <- getSeq(Hsapiens, resize(variantVR, 3, fix='center')) 
variantVR$altcontext <- replaceWithAlt(variantVR$context, 2, alt(variantVR))
show(variantVR)
## 
## what is the consequence (what has typically been "lost" to the reference?)
## just in terms of simple doublets, nothing fancy here 
## 
refCpG <- length( grep('CG', variantVR$context) ) / length( variantVR ) 
altCpG <- length( grep('CG', variantVR$altcontext) ) / length( variantVR )
##
## So half of a randomly selected set of 4 variants are "losses" of CpG sites.
## Given the sample size we obviously can't conclude anything, but purging of 
## CpG sites from the genome (via spontaneous 5mC -> T deamination) is a steady
## process that has heavily depleted the "average" genome of said doublets. 
##
print(paste0('In this sample, ', 
             (altCpG - refCpG)*100, '% of ref-to-alt changes create a CpG.'))
##
## It is interesting (NOT rigorous evidence of any sort) that a randomly picked
## set of SNPs from the 1KGP participants would happen to illustrate the theme.

getting started confusion

I just tried to setup the new R code on my local system and ran into the following error after following the README's code block:

Error in authorizer(authorize_url, is_interactive) : 
  httpuv package required to capture OAuth credentials.

We should fix the readme instructions so that this doesn't happen.

add a vignette for variant annotation with BioConductor

Include material similar to this excellent training material from the BioConductor team that uses 1,000 Genomes phase 1 variants

add retrying to all api methods

similar to the java client library, we should use some kind of exponential backoff for all API calls that fail with a >= 500 http error

A few things I experienced

Hi Sid,

Hope all is going well with the new version. I just tested a few things:

1) So because of how oauth-cache.R works, the .httr-oauth file gets saved in the directory where R is started from. Would it be possible to add to authenticate.R a check to see if the .httr-oauth file exists in ~/.store/genomics_r_client/ and if not, then it would initiate the browser authentication and place a copy of the file in ~/.store/genomics_r_client/? Otherwise depending on which directory I am in, then I will always have to authenticate via the browser if the file does not exist locally.

2) When I run the following getReadData query I get the following error, and it does not finish the request - the (...) are there to denote that there were many fetches of read data:

> getReadData(readsetId = "CPHG3MzoCRDRkqXzk7b6l_kB", chromosome ="1", start=100000000, end=101000000)
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZOUtPUE1DZ0EYAiDM0JvNCw
...
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZN2FlUE1DZ0EYAiDM0JvNCw
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZOHF1UE1DZ0EYAiDM0JvNCw
Fetching read data page
Parsing read data page
Error in elementType(x) :
  error in evaluating the argument 'x' in selecting a method for function 'elementType': Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
>

3) If I interrupt (via Ctrl-C) a getReadData request because I would like to start a different one, I can't seem to perform another request after that:

> getReadData(readsetId = "CMvnhpKTFhD04eLE-q2yxnU", chromosome ="1", start=100000000, end=101000000)
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: Ei9DaGREVFhadWFIQkxWRVpvUkRBMFpVeEZMWEV5ZVhodVZSSUJNUmlQMU5jdktBRRgCIIGy8rQI
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: Ei9DaGREVFhadWFIQkxWRVpvUkRBMFpVeEZMWEV5ZVhodVZSSUJNUmpxNU5jdktBQRgCIIGy8rQI
^C
>  getReadData(readsetId = "CMvnhpKTFhD04eLE-q2yxnU", chromosome ="1", start=10000000, end=101000000)
Fetching read data page
Error in function (type, msg, asError = TRUE)  :
  easy handled already used in multi handle

I would need to exit out of R, and resubmit the new getReadData request.

I know this is work in progress, but just wanted to give you a heads up.

Thanks,
Paul

submit this package to BioConductor

This package is still very much a work in progress, but it will be great to get it polished and ready for BioConductor! It needs to be run through a linter also to make sure the style is consistent, per #18

streamline authorization

It would be great if a user only had to think about authorization (e.g. providing a client_secrets file) the first time they install this package and from then on they are all set, regardless of their current working directory.

Add tests and setup travis

Update Bioconductor variant converters to tolerate missing values.

For example, variant data with no values for quality or filter cause the converters to fail.

Document steps for new package releases

After #28 is complete, let's be sure to document the exact steps for a new release much like we have for
utils-java

For example, click on the RStudio "check" button which takes care of running the following:

roxygen2::roxygenize('.', roclets=c('rd'))
R CMD build api-client-r
R CMD check --as-cran GoogleGenomics_0.1.2.tar.gz

and then for BioConductor to pick up the release do . . .

seqlevelsStyle oddities

This works OK:

getVariants(datasetId = "10473108253681171589", chromosome = "22",
start = 16051400, end = 16051500, oneBasedCoord = FALSE,
fields = NULL, converter = c)

Fetching variant page

Variants are now available

...

but this does not:

variant <- getVariants(datasetId = "10473108253681171589", chromosome = "chr22", start = 16051400, end = 16051500, oneBasedCoord = FALSE, fields = NULL, converter = c)

Fetching variant page

Variants are now available

NULL

This is OK:

getReadData(chromosome = "chr19", start = 45411941, end = 45412079,
readsetId = "CJ_ppJ-WCxDxrtDr5fGIhBA=",
endpoint = "https://www.googleapis.com/genomics/v1beta/",
pageToken = NULL)

Fetching read data page

Parsing read data page

Read data is now available

and this is not:

getReadData(chromosome = "19", start = 45411941, end = 45412079,
readsetId = "CJ_ppJ-WCxDxrtDr5fGIhBA=",
endpoint = "https://www.googleapis.com/genomics/v1beta/",
pageToken = NULL)

This sure seems like a job for GenomeInfoDb. Am I missing something obvious?
Also, specifying the reference assembly for each reference sequence (chromosome) would be nice.

Error [...] key 60 (char '<') not in lookup table

We get this rather cryptic error message after exporting variants from a Google genomics object:

granges <- variantsToGRanges(variants)

Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), :
key 60 (char '<') not in lookup table

If I have to guess, it may be due to alternateBases = "<NON_REF>" for reference intervals, when importing variants from a gVCF into the google genomics object.

Data are access-restricted, so I can't provide more information.

Update Bioconductor converters to handle multi-allelic data and data with non-variant segments.

The variant converters currently have trouble with multi-allelic data and non-variant segments. We can work around this by filtering and reshaping the data before sending it to the converters (example below) but it would be better if we pushed correct handling for this into the package.

See also #32 for another example of how non-variant segments can also be expressed in the data.

  variants <- getVariants(datasetId="10473108253681171589", chromosome="22",
              start=50300077, end=50301500)

  # Remove non-variant segments
  only_variants <- Filter(function(v) { 1 <= length(v$alternateBases)}, variants)

  # Convert to biallelic data by truncating alternateBases
  # (this isn't how it should be fixed, its just an example)
  biallelic_variants <- lapply(only_variants, function(v) {
    if(1 < length(v$alternateBases)) {
      v$alternateBases = v$alternateBases[[1]]
    }
    v
  })

  granges <- variantsToGRanges(biallelic_variants)

Port to native R support in Travis.

https://github.com/craigcitro/r-travis/wiki/Porting-to-native-R-support-in-Travis

I could not see a straight way to use the devel Bioc repo in the new native support. The code is now only compatible with the devel repo and using release Bioc packages will break the vignettes.

improve performance

This package is in active development and there are some inefficiencies in a few places (e.g., the code that accumulates multiple pages of variants and reads). But we'll tackle improving performance a bit when the code settles down a bit more.

Problem authenticating with invokeBrowser=F

Hi, I'm trying to authenticate on a server without X11, and whenever I give the "invokeBrowser=F" option I receive the following error message:

Error in authorizer(authorize_url, is_interactive) :
unused argument (is_interactive)

Thanks for looking into this.

Add a method to reset the OAuth token.

Something like the following, but package-internal and tested to ensure

getReads\Variants() return the "you need to authenticate" error until authentication is performed
the oauth flow is triggered by the next call to authenticate()

resetOAuth <- function() {
  if (FALSE != getOption("google_auth_cache_httr")) {
    message(paste("Removing cached OAuth token from", getOption("google_auth_cache_httr")))
    file.remove(getOption("google_auth_cache_httr"))
  }
  message("Removing cached OAuth token from memory.")
  rm("google_token", envir=GoogleGenomics:::.authStore)
}

variantsToVRanges loses sample/call information

VRanges can manage all the information on samples and calls, but the
current function just acquires one record per variant. Is this function/package
still in development?

genome reference assembly extraction from variants/reads objects?

One of the handy defaults in Bioconductor GRanges/VRanges objects is a slot for the genome (reference assembly) of each chromosome (should all be the same for any sane object, of course). This helps prevent comparisons of (e.g.) hg18 and hg19, or hg19 and hg38, data as if on the same coordinate system (a practice which is such a terrible idea that it's essentially never worth allowing).

Unfortunately, this safeguard can't be enforced when no genome is specified.

In the course of adding a default seqlevelsStyle for *Ranges, I realized it would be nice to have the genome reference automatically specified when retrieving variants or aligned reads. This doesn't seem to be possible from the current data structure. What am I overlooking here?

Improve messaging and instructions around authentication.

Right now the follow message is output upon each package attach, regardless of whether the oauth credentials were cached in a prior session. This can be a bit confusing.

GoogleGenomics: Do not forget to authenticate.
  Use GoogleGenomics::authenticate(file="secretsFile.json").
  See method documentation on how to obtain the secretsFile.

It would be nice if a message like that were output only if needed (e.g., if there are no cached credentials)

Additionally, it would be good to output something similar when a request returns an auth error such as:
ERROR: Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup.

Update getVariants* calls to accept a variantSetId instead of datasetId

There used to be one variantSet per dataset and the IDs were the same.
The v1beta2 and v1 variant search APIs accept a variant set ID, not a dataset ID.

See:
https://cloud.google.com/genomics/v1beta2/reference/variants/search
https://cloud.google.com/genomics/reference/rest/v1/variants/search

The getVariants* calls in R/variants.R should replace the datasetId parameter with variantSetId.

coding style

There are some competing style guides for R. Let's decide upon which one to follow or some combination of both and make sure the R package version of the client conforms to the style. (see also #17)

http://bioconductor.org/developers/how-to/coding-style/
https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml

Note that the identifier naming convention of avg.clicks in the Google style guide coincides with the S3 methods naming convention and sometimes causes issues/confusion.

http://stackoverflow.com/questions/6583265/what-does-s3-methods-mean-in-r

hosted shiny client seems to be dead

http://googlegenomics.shinyapps.io/reads returns a 404 when attempting to retrieve positions other than the default (as of 5 minutes ago)

bioconductor / googlegenomics Goto Github PK

googlegenomics's People

Contributors

Stargazers

Watchers

Forkers

googlegenomics's Issues

This works OK:

Fetching variant page

Variants are now available

...

but this does not:

Fetching variant page

Variants are now available

NULL

This is OK:

Fetching read data page

Parsing read data page

Read data is now available

and this is not:

Recommend Projects

Recommend Topics

Recommend Org