Git Product home page Git Product logo

rsidmap's Introduction

rsidmap's People

Contributors

zhanghaoyang0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

rsidmap's Issues

input file_dbsnp form command line error

Really nice tools for doing rsid mapping, I am trying to modify the code to input dbsnp vcf on the command line, here is the code

"
import os
import gzip
import time
import argparse
import numpy as np

start = time.time()

flags parser

parser = argparse.ArgumentParser(description='rsidmap')
parser.add_argument('--build', type=str, default='hg19')
parser.add_argument('--exact_map', type=bool, default=False)
parser.add_argument('--chr_col', type=str, default='CHR')
parser.add_argument('--pos_col', type=str, default='POS')
parser.add_argument('--ref_col', type=str, default='REF')
parser.add_argument('--alt_col', type=str, default='ALT')
parser.add_argument('--file_gwas', type=str, default='./example/df_hg19.txt')
parser.add_argument('--file_out', type=str, default='./example/df_hg19_rsidmap.txt')
parser.add_argument('--file_dbsnp', type=str, default='./dbsnp/GCF_000001405.25.gz') # add this argument for file_dbsnp
args = parser.parse_args()

build = args.build
chr_col = args.chr_col; pos_col = args.pos_col; ref_col = args.ref_col; alt_col = args.alt_col
file_gwas = args.file_gwas; file_out = args.file_out
exact_map = args.exact_map
file_dbsnp = args.file_dbsnp ## Access the file_dbsnp argument

# default setting, for test

build = 'hg19'

chr_col = 'CHR'

pos_col = 'POS'

ref_col = 'A2'

alt_col = 'A1'

file_gwas = './example/df_hg19.txt.gz'

file_out = './example/df_hg19_rsidmap.txt'

exact_map = False

print('Setting:')
print('build: '+ build)
print('chr_col: '+ chr_col); print('pos_col: '+ pos_col); print('ref_col: '+ ref_col); print('alt_col: '+ alt_col)
print('file_gwas: '+ file_gwas); print('file_out: '+ file_out)
print('exact_map: '+ str(exact_map))

dicts and functions

def find_rsid(items, chr, pos, ref, alt, exact_map):
# indel must exact_map (i.e., C>CT != CT>C)
exact_map1 = True if (len(alt)>1)|(len(ref)>1) else exact_map
snp = f'{chr}:{pos}:{ref}:{alt}' # pseudo snp
if len(items)==0: return(snp)
else:
items = [x.split()[2:5] for x in items] # select useful fields
# break multi allele (e.g., A,G>T to A>G and C>T)
items1 = [] # alt combination
for item in items:
items1 += [[item[0], x, y] for x in item[1].split(',') for y in item[2].split(',')]
flags = [[x[1], x[2]]==[ref, alt] if exact_map1 else {x[1], x[2]}=={ref, alt} for x in items1]
if sum(flags)>0: snp = np.array(items1)[flags][0][0] # greedy map, use first
return(snp)

def openfile(filename):
if filename.endswith('.gz'): return gzip.open(filename, 'rt')
else: return open(filename)

d19 = {"1": "NC_000001.10", "2": "NC_000002.11", "3": "NC_000003.11", "4": "NC_000004.11","5": "NC_000005.9",
"6": "NC_000006.11", "7": "NC_000007.13", "8": "NC_000008.10","9": "NC_000009.11", "10": "NC_000010.10",
"11": "NC_000011.9", "12": "NC_000012.11","13": "NC_000013.10", "14": "NC_000014.8", "15": "NC_000015.9",
"16": "NC_000016.9","17": "NC_000017.10", "18": "NC_000018.9", "19": "NC_000019.9", "20": "NC_000020.10",
"21": "NC_000021.8", "22": "NC_000022.10", "X": "NC_000023.10", "Y": "NC_000024.9", 'M': "NC_012920.1"}
d38 = {"1": "NC_000001.11", "2": "NC_000002.12", "3": "NC_000003.12", "4": "NC_000004.12", "5": "NC_000005.10",
"6": "NC_000006.12", "7": "NC_000007.14", "8": "NC_000008.11", "9": "NC_000009.12", "10": "NC_000010.11",
"11": "NC_000011.10", "12": "NC_000012.12", "13": "NC_000013.11", "14": "NC_000014.9", "15": "NC_000015.10",
"16": "NC_000016.10", "17": "NC_000017.11", "18": "NC_000018.10", "19": "NC_000019.10", "20": "NC_000020.11",
"21": "NC_000021.9", "22": "NC_000022.11", "X": "NC_000023.11", "Y": "NC_000024.10", "M": "NC_012920.1"}
chr2ncid = {'hg19': d19, 'hg38': d38}

#file_dbsnp = {'hg19': './dbsnp/GCF_000001405.25.gz', 'hg38': './dbsnp/GCF_000001405.40.gz'} ##remove this line

cols = [chr_col, pos_col, ref_col, alt_col]
res = open(file_out, 'w')
nsnp = len(list(openfile(file_gwas)))-1; i = 1; n_map = 0

with openfile(file_gwas) as f:
for line in f:
if i % round(nsnp/20)==0: print(f'processed {i-1}/{nsnp} snp ({round(100*(i-1)/nsnp, 1)}%)')
if i == 1: # header row
idx = [line.split().index(x) for x in cols] # idx for chr, pos, a1, a2
out = line.replace('\n', '\tSNP\n')
else:
chr, pos, ref, alt = [line.split()[x] for x in idx]
ncid = chr2ncid[build][str(chr)] # chr id in dbsnp
items = os.popen(f'tabix {file_dbsnp[build]} {ncid}:{pos}-{int(pos)}').readlines()
snp = find_rsid(items, chr, pos, ref, alt, exact_map)
if 'rs' in snp: n_map += 1
out = line.replace('\n', '\t'+snp+'\n')
_ = res.write(out) # use a variable to aviod printing
i += 1

res.close()

end = time.time()
print(f'done! N. rsid mapped: {n_map}')
print (f'spend {round(end-start, 2)} sec')
"

but I got error on line 91, in
items=os.popen(f'tabix {file_dbsnp[build]} {ncid}:{pos}-{int(pos)}').readlines()
TypeErrorL string indeces must be integers

I am a beginner for python user, is there any advice to change the code?

Please update the code for downloading latest_release dbsnp. Thank you!

When I used the code to download latest_release dbsnp, I met:

wget -c https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz -P dbsnp/
--2023-03-14 14:09:08-- https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.229, 165.112.9.228, 2607:f220:41e:250::7, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25405089331 (24G) [application/x-gzip]
dbsnp: File existsdbsnp/GCF_000001405.25.gz: No such file or directory

I suppose I met the situation you metioned "dbsnp is updated", so please update the code. Thanks a lot!

code bug: 'if "#" in line: continue'

There is a bug in line 104 of rsidmap.v2.py.

code bug: 'if "#" in line: continue'

This will skip all lines containing the character "#", even though it is an SNP and not a comment line.
Such as:
NC_000001.11 1213738 rs587777075 G A . . RS=587777075;dbSNPBuildID=142;SSR=0;GENEINFO=TNFRSF4:7293;VC=SNV;PUB;NSM;U5;R5;GNO;FREQ=ExAC:1,3.474e-05|GnomAD:1,2.139e-05|GnomAD_exomes:1,2.215e-05|TOMMO:1,3.539e-05|TOPMED:1,1.511e-05|dbGaP_PopFreq:0.9999,0.0001377;CLNVI=.,OMIM:600315.0001|UniProtKB:P43489**#**VAR_070942;CLNORIGIN=.,1;CLNSIG=.,5;CLNDISDB=.,MONDO:MONDO:0014268/MedGen:C3810053/Orphanet:431149/OMIM:615593;CLNDN=.,Combined_immunodeficiency_due_to_OX40_deficiency;CLNREVSTAT=.,no_criteria;CLNACC=.,RCV000082860.5;CLNHGVS=NC_000001.11:g.1213738=,NC_000001.11:g.1213738G>A

Finally, thank you for sharing the source code of rsidmap very much, which has given me a lot of help!

KeyError: '19'

Hi! Follow all the steps for cloning and download dbsnp database and everything was fine.

However, when running rsidmap_v2 got this:

python ./rsidmap/code/rsid map_v2.py --build 19 --chr_col Chromosome --pos_col Position --r ef_col AlternateAllele --alt_col EffectAllele --file_gwas ./data /GWAS_sumstats/kneehip_oa.txt --file_out ./data/GWAS_sumstats_kneehip_oa_rsids.txt Setting: build: 19 chr_col: Chromosome pos_col: Position ref_col: AlternateAllele alt_col: EffectAllele file_gwas: ./data/GWAS_sumstats/kneehip_oa.txt file_out: ./data/GWAS_sumstats_kneehip_oa_rsids.txt exact_map: False making dnsnp dict ... 23763393 unique and valid snp. Traceback (most recent call last): File "./rsidmap/code/rsidmap_v2.py", line 102, in <module> with gzip.open(file_dbsnp[build], 'rt') as f: KeyError: '19'

GWAS file can be download from: https://personal.broadinstitute.org/ryank/KP.Format.GO.FILTER.GW.KneeHipOA.FULL.09052019.txt.gz

Any idea of how can I solve this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.