hpages / snplocsforge Goto Github PK
View Code? Open in Web Editor NEWTools for building SNPlocs data packages
Tools for building SNPlocs data packages
We have noted a substantial increase the proportion of non-biallelic SNPs across dbSNP 144 and 155 (increased from ~1% to 24%). This is explained in detail here and summarised in the code below (fuicntion written by @hpages).
Note that this is based on the dbSNP versions in Bioconductor (SNPlocs.Hsapiens.dbSNP) so we would like to confirm this aligns with dbSNP versions outside of Biocnductor but there doesn't seem to be any numbers available online about it. The only piece of work about this is quite old now and doesn't give specific numbers SNPs that come in threes. It would be great if anyone has any insight on this.
#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt",
required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)
library(BSgenome)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
pkgenv <- loadNamespace(pkgname)
snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
snps <- snpsById(snplocs, rsids, ifnotfound="drop")
unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
mcols(snps)$RefSNP_id)[rsids])
}
dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match N prop
#1: both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic 32266661 0.2382168183
#3: both non-biallelic 2565775 0.0189424855
#4: dbSNP 144 non-biallelic 13116 0.0000968322
And just to visually represent them:
library(ggplot2)
library(cowplot)
pastel_cols <- c("#9A8822","#F5CDB4","#F8AFA8",
"#FDDDA0","#74A089","#85D4E3",
#added extra to make 7
'#78A2CC')
ggplot(summ, aes(x = factor(match,levels=rev(c("both biallelic",
"both non-biallelic",
"dbSNP 155 non-biallelic",
"dbSNP 144 non-biallelic"))), y = N))+
geom_bar(stat = "identity", fill = pastel_cols[c(2,3,2,2)])+
geom_text(label = with(summ, paste(prettyNum(N,big.mark=",",scientific=FALSE),
paste0('\n (', round(prop*100,2), '%)'))),
hjust = -.3)+
ylim(0,max(summ$N)+20000000)+
coord_flip() +
theme_cowplot()+
theme(axis.title.y=element_blank())
Thanks for all the great work. Would it be possible to release: XtraSNPlocs.Hsapiens.dbSNP155.GRCh37
Hello, I would like to request if you could create dbSNP154.GRCh38/dbSNP154.GRCh37 or provide guidance to built dbSNP154. GWAS Catalogue uses dbSNP154 version and this could be helpful for help working on GWAS data.
Many thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.