arrogantrobot / 23andme2vcf Goto Github PK

convert your 23andme raw file to VCF | DEPRECATED, please see https://github.com/plantimals/2vcf

License: MIT License

Perl 100.00%

23andme2vcf's Introduction

Deprecated

In order to increase portability, reliability, and speed, I have moved this project to https://github.com/plantimals/2vcf , where it has been reimplemented. Please check it out there. If you have any questions or comments, please show up in our issues: https://github.com/plantimals/2vcf/issues/new

Convert your 23andme raw data into VCF format

This tool has been built in order to allow a user of 23andme to process the raw file format into a format more widely useful across bioinformatics tools, the VCF (see format details).

Two references are included which are limited to only those sites targetted by the 23andme microarray. 23andme recently made the change from build36 (hg18) to build37 (hg19), so any raw files downloaded before August 9, 2012 will be based on the older build36 coordinates. Simply download the raw data again and it will be on the build37 coordinates. If there is a need for build36 support, let me know and I can add that in as an option.

IN/DELs are currently unsupported by this program. If you would like to see support for indels, please suggest a location where I can find the exact alleles, so they can be correctly represented in the resulting VCF.

If your sample was processed after November of 2013, you have version 4 results. If you see a suggestion to run on version 4 after your first conversion attempt, please do so, as you will get more usable results.

Usage

First, download your raw data from 23andme. Log in and click on your name in the upper right corner. Select the "browse raw data" option. Look just below your name, for a link called "DOWNLOAD". You can also just [click here] (https://www.23andme.com/you/download/ "23andme raw data download"). Enter your password, answer the secret question, and grab the "All DNA" data set. Hit the "download data" button. Once you have downloaded the raw data, unzip the file and note it's name and location.

If you are on windows, I won't be able to help you with specific commands, but it should be easy to figure out.

Dependencies:

git
perl
23andme raw data

These instructions will work from any bash shell, and probably plenty of other shells as well.

git clone git://github.com/arrogantrobot/23andme2vcf.git

cd 23andme2vcf

If you don't know what git is, or don't have the ability to install it, you can just click on the "download zip" button on near the top right of this github page. Once the download completes, go to the terminal, cd to the directory you downloaded the .zip to and unzip it. Then cd into the 23andme2vcf-master directory. Now run the script as shown below.

perl 23andme2vcf.pl /path/to/23andme_raw.txt /path/to/output.vcf

Reference

The reference contained here-in is a list of reference bases taken from [NCBI build37] (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ "NCBI build37"), which matches those sites included in the 23andme microarray exactly, in order to limit the file size and speed up creation of the VCF.

23andme2vcf's People

Contributors

Stargazers

Watchers

23andme2vcf's Issues

Tried to use it, but the output file doesn't contain any rows below the static headers for the .vcf file?

I followed the simple instructions to convert my 23andme downloaded .txt file to a .vcf file, receiving this output response:

gunzip: 23andme_v3_hg19_ref.txt.gz: not in gzip format 587747 sites were not included; these unmatched references can be found in sites_not_in_reference.txt.Try running again, but specify the other reference version: ./23andme2vcf.pl {path to}mysnps.txt mysnps.vcf 4

FYI, I replaced the correct path to the .txt file with {path to} in this issue.

I tried 23andme_v3_hg19_ref.txt.gz version as well, however received the same response, telling me to try the version 4. In either case, the output file (mysnps.vcf) is empty below this section:

`##fileformat=VCFv4.2

fileDate=20151116

source=23andme2vcf.pl https://github.com/arrogantrobot/23andme2vcf

reference=file://23andme_v4_hg19_ref.txt.gz

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GENOTYPE

Any clue on what's wrong? Please and thanks.

List of Steps to Convert 23andme.txt to VCF

Hi,

I couldn't find any other way to contact you so I suppose it'll have to be through this. I was wondering if you could make an exact list of steps of how to convert my 23andme.txt file to the VCF format using your tool. I'm asking you as a person who has near to no experience with coding or computer science. If you are able to tell me exactly how to do this, then I can share your tool with others like myself for a personal genomics class for uploading VCF files to various different programs and tools for analysis. Your assistance in this matter would be greatly appreciated.

Thanks.

New reference

Hi Rob,

thanks for the script, do you mind committing the code that generates the reference file, my says 8140 sites were not included even with version 4.

Many thanks.

V5 Chip

Could you please update the reference files for the new V5 chip?

Use gunzip -c rather than zcat for OS X

OS X includes a broken zcat (can't handle .gz files, only .Z files). The following patch fixes this issue.

Cheers,
Shaun

diff --git a/23andme2vcf.pl b/23andme2vcf.pl
index 2ef215f..d545a56 100755
--- a/23andme2vcf.pl
+++ b/23andme2vcf.pl
@@ -18,10 +18,10 @@ missing($raw_path) unless -s $raw_path;
 missing($ref_path) unless -s $ref_path;

 #open the raw data as a zip or text
-my $fh = ($raw_path =~ m/zip$/) ? IO::File->new("zcat $raw_path|") : IO::File->new($raw_path);
+my $fh = ($raw_path =~ m/zip$/) ? IO::File->new("gunzip -c $raw_path|") : IO::File->new($raw_path);

 #open the compressed reference file
-my $ref_fh = IO::File->new("zcat $ref_path|");
+my $ref_fh = IO::File->new("gunzip -c $ref_path|");

 my $output_fh = IO::File->new(">$output_path");

Upgrade to VCF4.2?

The most current VCF specification is VCF 4.2 (http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variant-call-format-version-41).

In the limited utilization of the features of the VCF spec by the generated VCF file, none of the fields used change from VCF4.1 to VCF4.2. Additionally, when the version in the VCF header is changed to 4.2, the generated VCF file is still syntactically valid according to vcf-validator.

Are there any reasons not to have 23andme2vcf generate 4.2 instead of 4.1? If there are none, I'm happy to make the PR.

b36?

Hi,

I would like to convert v2 23andme files (build 36) to vcf. There are alot of v2 files to convert, and I am not able to re-download the data. Could you update your program to support v2 build 36 conversion?

Broken genotypes

I keep getting just "0" or "1" for genotype after ~560,000 correct conversions using the 23andme_v4_hg19_ref.txt.gz data and after ~500,000 correct conversions using the 23andme_v4_hg19_ref.txt.gz data using "perl 23andme2vcf.pl <path to 23andme txt.zip file> genome_XYZ.vcf 4 (or 3)". The break happens in both after:

chrX 2689575 rs311150 G A . . . GT 1/1

I've seen this with two different individuals' SNP files, one generated in Feb 2015 and another generated Dec 2016. Both the above rsID and the following one are still listed in the current dbSNP, and both entries in the 23andme_v5_ht19_ref.txt.gz appear valid. I've run this on a laptop and on a Linux server so don't think it's a resource issue. Any suggestions? Thanks.

Larry

MT reference incorrect

It appears that you have used UCSC's reference hg19 instead of NCBI's GRCh37 to build your own reference. Normally this is fine, but there are differences between these builds at the chromosome MT. For example, looking at UCSC genome browser for MT:150 shows a T instead of a C.

You can use bcftools to validate against a reference.

bcftools norm -ce -f /reference/homo.sapiens/GRCh37/Homo_sapiens_assembly19.fasta 23andme.vcf

Add In/Del support

This may take some significant development. The exact coordinates and alleles involved in the insertions and deletions signified by the I's and D's in the genotype column of the 23andme raw data may need to be retrieved from dbSNP or some other outside source. Once done, this reference can be added in to the existing SNP reference file.

23andMe SNPs have been updated

The 23andMe SNPs have been updated, so when I run the script I see this error message:

raw data file and reference file are out of sync at ./23andme2vcf.pl line 153, <GEN1> line 587611.

hg18 reference

hi rob,

i tried using your script for converting the 23andme data from the personal genome project. it appears that data is still aligned to hg18. could you proivde the corresponding reference or let me know how to compile it myself (like, do you have a script that derives it from the ncbi fasta files?)

thanks
tim

No Licensing Data

I noticed that there's no licensing document or statement for the script; I'm interested to expand it but I don't want to cause any issues. What terms is this released under, please?

23andMe Changed their columns

My file looks like this:

rs548049170	1	69869	TT
rs13328684	1	74792	--
rs9283150	1	565508	--

But the reference document in the cloned repo looks like this:

chr1	734462	rs12564807	G
chr1	752721	rs3131972	A
chr1	760998	rs148828841	C

Warnings for "use of uninitialized value" lines 164-170

running the script on 23andMe raw data (downloaded 29th of Oct, 2013) prints out a TON of warnings:

Use of uninitialized value $data_line in scalar chomp at 23andme2vcf.pl line 164, <GEN1> line 960613. Use of uninitialized value $data_line in split at 23andme2vcf.pl line 165, <GEN1> line 960613. Use of uninitialized value $chr in string eq at 23andme2vcf.pl line 166, <GEN1> line 960613. Use of uninitialized value $my_pos in numeric gt (>) at 23andme2vcf.pl line 170, <GEN1> line 960613.

over and over again. Also, when script starts, I get this warning for every line in the data file:

Use of uninitialized value $my_pos in numeric gt (>) at 23andme2vcf.pl line 170, <GEN1> line 274267.