bioconductor / googlegenomics Goto Github PK
View Code? Open in Web Editor NEWAn R package for Google Genomics API queries.
License: Apache License 2.0
An R package for Google Genomics API queries.
License: Apache License 2.0
Hi Sid,
Congratulations on the new location, but probably might want to let Travis know of the new location (Bioconductor/GoogleGenomics), since it's still looking at the old one (https://travis-ci.org/googlegenomics/api-client-r) and cannot build it [i.e. it's saying build inaccessible].
Probably also updating everything else to point to the new one too would be nice - i.e. the Readme still points to:
devtools::install_github("googlegenomics/api-client-r")
Thanks,
~p
Per #15 and ensuing blahblah, here's the unvarnished code.
If it breaks, you get to keep the pieces. :-)
library(GoogleGenomics)
##
## may need to do this interactively:
##
authenticate('client_secrets.json')
##
## the rest should not require user input
##
variant <- getVariants(datasetId = "10473108253681171589", chromosome = "22",
start = 16051400, end = 16051500, oneBasedCoord = FALSE,
fields = NULL, converter = c)
variantVR <- variantsToVRanges(variant, slStyle = 'UCSC')
##
## Replace the middle base of a DNAStringSet with the alternate SNP allele
## (only tested for SNPs where both reference and alternate are 1bp wide!)
## There are so very many ways in which this is horrible...
##
replaceWithAlt <- function(stringSet, position, alt) {
strings <- as.character(stringSet)
substr(strings, position, position) <- as.character(alt)
return(DNAStringSet(strings))
}
##
## Use the BSgenome representation of hg19 to play with the ref/alt sequences:
## One base on each side, what happens when we go from reference to alternate?
## Major issue: how to know what reference assembly the variants come from?
##
library(BSgenome.Hsapiens.UCSC.hg19)
##
## Absent a reference assembly, the above (a guess!) could be very wrong...
## Hence issue #13 in the github repository
##
variantVR$context <- getSeq(Hsapiens, resize(variantVR, 3, fix='center'))
variantVR$altcontext <- replaceWithAlt(variantVR$context, 2, alt(variantVR))
show(variantVR)
##
## what is the consequence (what has typically been "lost" to the reference?)
## just in terms of simple doublets, nothing fancy here
##
refCpG <- length( grep('CG', variantVR$context) ) / length( variantVR )
altCpG <- length( grep('CG', variantVR$altcontext) ) / length( variantVR )
##
## So half of a randomly selected set of 4 variants are "losses" of CpG sites.
## Given the sample size we obviously can't conclude anything, but purging of
## CpG sites from the genome (via spontaneous 5mC -> T deamination) is a steady
## process that has heavily depleted the "average" genome of said doublets.
##
print(paste0('In this sample, ',
(altCpG - refCpG)*100, '% of ref-to-alt changes create a CpG.'))
##
## It is interesting (NOT rigorous evidence of any sort) that a randomly picked
## set of SNPs from the 1KGP participants would happen to illustrate the theme.
I just tried to setup the new R code on my local system and ran into the following error after following the README's code block:
Error in authorizer(authorize_url, is_interactive) :
httpuv package required to capture OAuth credentials.
We should fix the readme instructions so that this doesn't happen.
Include material similar to this excellent training material from the BioConductor team that uses 1,000 Genomes phase 1 variants
similar to the java client library, we should use some kind of exponential backoff for all API calls that fail with a >= 500 http error
Hi Sid,
Hope all is going well with the new version. I just tested a few things:
1) So because of how oauth-cache.R works, the .httr-oauth
file gets saved in the directory where R is started from. Would it be possible to add to authenticate.R a check to see if the .httr-oauth
file exists in ~/.store/genomics_r_client/
and if not, then it would initiate the browser authentication and place a copy of the file in ~/.store/genomics_r_client/
? Otherwise depending on which directory I am in, then I will always have to authenticate via the browser if the file does not exist locally.
2) When I run the following getReadData
query I get the following error, and it does not finish the request - the (...) are there to denote that there were many fetches of read data:
> getReadData(readsetId = "CPHG3MzoCRDRkqXzk7b6l_kB", chromosome ="1", start=100000000, end=101000000)
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZOUtPUE1DZ0EYAiDM0JvNCw
...
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZN2FlUE1DZ0EYAiDM0JvNCw
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: EjBDaGhEVUVoSE0wMTZiME5TUkZKcmNWaDZhemRpTm14ZmEwSVNBVEVZOHF1UE1DZ0EYAiDM0JvNCw
Fetching read data page
Parsing read data page
Error in elementType(x) :
error in evaluating the argument 'x' in selecting a method for function 'elementType': Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
>
3) If I interrupt (via Ctrl-C) a getReadData
request because I would like to start a different one, I can't seem to perform another request after that:
> getReadData(readsetId = "CMvnhpKTFhD04eLE-q2yxnU", chromosome ="1", start=100000000, end=101000000)
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: Ei9DaGREVFhadWFIQkxWRVpvUkRBMFpVeEZMWEV5ZVhodVZSSUJNUmlQMU5jdktBRRgCIIGy8rQI
Fetching read data page
Parsing read data page
Continuing read query with the nextPageToken: Ei9DaGREVFhadWFIQkxWRVpvUkRBMFpVeEZMWEV5ZVhodVZSSUJNUmpxNU5jdktBQRgCIIGy8rQI
^C
> getReadData(readsetId = "CMvnhpKTFhD04eLE-q2yxnU", chromosome ="1", start=10000000, end=101000000)
Fetching read data page
Error in function (type, msg, asError = TRUE) :
easy handled already used in multi handle
I would need to exit out of R, and resubmit the new getReadData
request.
I know this is work in progress, but just wanted to give you a heads up.
Thanks,
Paul
This package is still very much a work in progress, but it will be great to get it polished and ready for BioConductor! It needs to be run through a linter also to make sure the style is consistent, per #18
It would be great if a user only had to think about authorization (e.g. providing a client_secrets file) the first time they install this package and from then on they are all set, regardless of their current working directory.
For example, variant data with no values for quality
or filter
cause the converters to fail.
After #28 is complete, let's be sure to document the exact steps for a new release much like we have for
utils-java
For example, click on the RStudio "check" button which takes care of running the following:
roxygen2::roxygenize('.', roclets=c('rd'))
R CMD build api-client-r
R CMD check --as-cran GoogleGenomics_0.1.2.tar.gz
and then for BioConductor to pick up the release do . . .
getVariants(datasetId = "10473108253681171589", chromosome = "22",
start = 16051400, end = 16051500, oneBasedCoord = FALSE,
fields = NULL, converter = c)
variant <- getVariants(datasetId = "10473108253681171589", chromosome = "chr22", start = 16051400, end = 16051500, oneBasedCoord = FALSE, fields = NULL, converter = c)
getReadData(chromosome = "chr19", start = 45411941, end = 45412079,
readsetId = "CJ_ppJ-WCxDxrtDr5fGIhBA=",
endpoint = "https://www.googleapis.com/genomics/v1beta/",
pageToken = NULL)
getReadData(chromosome = "19", start = 45411941, end = 45412079,
readsetId = "CJ_ppJ-WCxDxrtDr5fGIhBA=",
endpoint = "https://www.googleapis.com/genomics/v1beta/",
pageToken = NULL)
This sure seems like a job for GenomeInfoDb. Am I missing something obvious?
Also, specifying the reference assembly for each reference sequence (chromosome) would be nice.
We get this rather cryptic error message after exporting variants from a Google genomics object:
granges <- variantsToGRanges(variants)
Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), :
key 60 (char '<') not in lookup table
If I have to guess, it may be due to alternateBases = "<NON_REF>" for reference intervals, when importing variants from a gVCF into the google genomics object.
Data are access-restricted, so I can't provide more information.
The variant converters currently have trouble with multi-allelic data and non-variant segments. We can work around this by filtering and reshaping the data before sending it to the converters (example below) but it would be better if we pushed correct handling for this into the package.
See also #32 for another example of how non-variant segments can also be expressed in the data.
variants <- getVariants(datasetId="10473108253681171589", chromosome="22",
start=50300077, end=50301500)
# Remove non-variant segments
only_variants <- Filter(function(v) { 1 <= length(v$alternateBases)}, variants)
# Convert to biallelic data by truncating alternateBases
# (this isn't how it should be fixed, its just an example)
biallelic_variants <- lapply(only_variants, function(v) {
if(1 < length(v$alternateBases)) {
v$alternateBases = v$alternateBases[[1]]
}
v
})
granges <- variantsToGRanges(biallelic_variants)
https://github.com/craigcitro/r-travis/wiki/Porting-to-native-R-support-in-Travis
I could not see a straight way to use the devel Bioc repo in the new native support. The code is now only compatible with the devel repo and using release Bioc packages will break the vignettes.
This package is in active development and there are some inefficiencies in a few places (e.g., the code that accumulates multiple pages of variants and reads). But we'll tackle improving performance a bit when the code settles down a bit more.
Hi, I'm trying to authenticate on a server without X11, and whenever I give the "invokeBrowser=F" option I receive the following error message:
Error in authorizer(authorize_url, is_interactive) :
unused argument (is_interactive)
Thanks for looking into this.
Something like the following, but package-internal and tested to ensure
getReads\Variants()
return the "you need to authenticate" error until authentication is performedauthenticate()
resetOAuth <- function() {
if (FALSE != getOption("google_auth_cache_httr")) {
message(paste("Removing cached OAuth token from", getOption("google_auth_cache_httr")))
file.remove(getOption("google_auth_cache_httr"))
}
message("Removing cached OAuth token from memory.")
rm("google_token", envir=GoogleGenomics:::.authStore)
}
VRanges can manage all the information on samples and calls, but the
current function just acquires one record per variant. Is this function/package
still in development?
One of the handy defaults in Bioconductor GRanges/VRanges objects is a slot for the genome (reference assembly) of each chromosome (should all be the same for any sane object, of course). This helps prevent comparisons of (e.g.) hg18 and hg19, or hg19 and hg38, data as if on the same coordinate system (a practice which is such a terrible idea that it's essentially never worth allowing).
Unfortunately, this safeguard can't be enforced when no genome is specified.
In the course of adding a default seqlevelsStyle for *Ranges, I realized it would be nice to have the genome reference automatically specified when retrieving variants or aligned reads. This doesn't seem to be possible from the current data structure. What am I overlooking here?
Right now the follow message is output upon each package attach, regardless of whether the oauth credentials were cached in a prior session. This can be a bit confusing.
GoogleGenomics: Do not forget to authenticate.
Use GoogleGenomics::authenticate(file="secretsFile.json").
See method documentation on how to obtain the secretsFile.
It would be nice if a message like that were output only if needed (e.g., if there are no cached credentials)
Additionally, it would be good to output something similar when a request returns an auth error such as:
ERROR: Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup.
There used to be one variantSet per dataset and the IDs were the same.
The v1beta2 and v1 variant search APIs accept a variant set ID, not a dataset ID.
See:
https://cloud.google.com/genomics/v1beta2/reference/variants/search
https://cloud.google.com/genomics/reference/rest/v1/variants/search
The getVariants* calls in R/variants.R should replace the datasetId parameter with variantSetId.
There are some competing style guides for R. Let's decide upon which one to follow or some combination of both and make sure the R package version of the client conforms to the style. (see also #17)
http://bioconductor.org/developers/how-to/coding-style/
https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml
Note that the identifier naming convention of avg.clicks in the Google style guide coincides with the S3 methods naming convention and sometimes causes issues/confusion.
http://stackoverflow.com/questions/6583265/what-does-s3-methods-mean-in-r
http://googlegenomics.shinyapps.io/reads returns a 404 when attempting to retrieve positions other than the default (as of 5 minutes ago)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.