timknut / geno_imputation Goto Github PK
View Code? Open in Web Editor NEWDocumentation and code base for the Geno/Roslin imputation project
Documentation and code base for the Geno/Roslin imputation project
Is that ok to have 'NA' for the missing values in the position column in the annotation file?
the output of the script is a file with header (SNPs name)
tail -n +2 in.ped > out.ped in order to remove it
@timknut what do you think?
Paolo
@Unoqualsiasi Steps 6 here: https://github.com/timknut/geno_imputation/blob/master/scripts/plink_workflow.Rmd
It is unclear how you implemented the HET filter and at what threshold you set it?
Can you please include how you made the file hetero_remove
and the HEt threshold? This is the commit #16
There are quite a few duplicate positions for markers with differrent names in the raw file set FinalReport_54kV2_collection_ed1.ped and FinalReport_54kV2_collection_ed1.map.
Found these:
CHR POS ALLELES IDS
1 59409838 1,2 ARS-USMARC-Parent-DQ404150-rs29012530 UA-IFASA-2167
1 151349514 1,3 ARS-USMARC-Parent-DQ404151-rs29019282 Hapmap35832-SCAFFOLD197372_885
2 111155237 1,3 ARS-USMARC-Parent-DQ786757-rs29019900 Hapmap36382-SCAFFOLD210095_19074
3 58040470 1,2 ARS-USMARC-Parent-DQ435443-rs29010802 Hapmap52375-rs29010802
3 116448759 1,2 ARS-USMARC-Parent-DQ839235-rs29012691 Hapmap38870-BTA-01737
4 17200594 1,3 ARS-USMARC-Parent-DQ647186-rs29014143 Hapmap58054-rs29014143
4 94176209 1,3 ARS-USMARC-Parent-DQ485413-no-rs Hapmap33892-BES6_Contig314_677
7 18454636 1,2 ARS-USMARC-Parent-DQ786758-rs29024430 Hapmap36218-SCAFFOLD41765_2717
8 88974063 1,2 ARS-USMARC-Parent-DQ837644-rs29010468 UA-IFASA-2827
8 106174871 1,3 ARS-USMARC-Parent-DQ674265-rs29011266 Hapmap36391-SCAFFOLD165033_11046
9 45729853 1,3 ARS-USMARC-Parent-DQ846689-rs29011985 UA-IFASA-1922
9 98483346 1,2 ARS-USMARC-Parent-DQ786765-rs29009858 UA-IFASA-2515
10 55611885 1,3 ARS-USMARC-Parent-DQ984827-rs29012019 Hapmap59786-rs29012019
12 80629629 1,2 ARS-USMARC-Parent-DQ832700-rs29012872 Hapmap36566-SCAFFOLD135238_3808
13 25606469 1,4 ARS-USMARC-Parent-EF034081-rs29009668 Hapmap36096-SCAFFOLD140080_30362
14 48380429 1,3 ARS-USMARC-Parent-DQ846691-rs29019814 Hapmap35881-SCAFFOLD20653_10639
15 21207529 1,3 ARS-USMARC-Parent-EF042090-no-rs Hapmap35077-BES9_Contig405_919
15 38078775 1,3 ARS-USMARC-Parent-DQ866817-no-rs Hapmap34596-BES7_Contig444_1293
15 79187295 1,2 ARS-USMARC-Parent-DQ866818-rs29011701 UA-IFASA-5162
18 1839733 1,3 ARS-USMARC-Parent-EF028073-rs29014953 Hapmap57363-rs29014953
20 676757 1,3 ARS-USMARC-Parent-DQ984828-rs29010004 Hapmap59181-rs29010004
20 17837675 1,3 ARS-USMARC-Parent-DQ888313-no-rs Hapmap34041-BES1_Contig298_838
21 65198296 1,2 ARS-USMARC-Parent-EF026085-rs29021607 Hapmap35417-SCAFFOLD255533_15525
22 56526462 1,3 ARS-USMARC-Parent-EF034082-rs29013532 Hapmap55319-rs29013532
26 8221270 1,3 ARS-USMARC-Parent-DQ990834-rs29013727 Hapmap53362-rs29013727
26 38233337 1,3 ARS-USMARC-Parent-EF034086-no-rs Hapmap35000-BES9_Contig272_944
28 35331560 1,3 ARS-USMARC-Parent-EF026086-rs29013660 Hapmap36071-SCAFFOLD106623_11509
28 44261945 1,3 ARS-USMARC-Parent-EF042091-rs29014974 Hapmap36794-SCAFFOLD186736_5402
29 28647816 1,3 ARS-USMARC-Parent-EF034080-rs29024749 Hapmap36059-SCAFFOLD50303_4748
Have you seen these, Paolo?
https://www.cog-genomics.org/plink2/data#list_duplicate_vars can deal with them.
use the 25K data Tim provided in Dropbox
Tell when this is Ok.
All the Nordic files present this issue in the metadata necessary to convert the sample_id into ID (in order to provide a link with the pedigree for the imputation).
example:
sample_id RDCDNKM000001010000963
ID 1.057E9
Someone here messed with column attributes in some editing step
This could be a big problem because we don't know what program they used (we can have a precision problem if we try to convert them back)....any ideas?
Paolo
At the current time my DIR tree looks like this.
Do you agree @Unoqualsiasi ?
When the LAB exports illumina genotypes, they usually export them in the TOP-strand format.
Our pipeline expects this.
However, some of the data can be in another format,
Download the annotation from snpCHIMP with the TOP-alleles, and compare for the illumina files.
The markers of this file have some problems. Someone edited them before.
Example :
marker original name: ARS-USMARC-Parent-AY761135-RS29003723
Noric_54k_2012_ed1 name: ARS-USMARC-PARENT-AY761135-RS29003723
This problem allows the compatibility of only 34732 markers over 52259. I will try to fix it.
UPDATE: i was able to recover the information
The changes i made in Noric_54k_2012_ed1 are:
Parent instead of PARENT
-no instead of -NO
-rs instead of -RS
_Contig instead of _CONTIG
Hapmap instead of HAPMAP
You can close the issue
Paolo
It appears that this file contains 73628 SNPs instead of 54001 as reported in the header of the file -.-
grep '2005' FinalReport_54kV1_ed1.txt | cut -f 1 > FinalReport_54kV1_ed1_markerlist.txt
wc -l FinalReport_54kV1_ed1_markerlist.txt
@Unoqualsiasi adds script to produce table for all raw-data samples not beeing in the collection.
Tim adds script and table for the collection stuff.
Ref commits : a1c243e and 5a3b612
The plink files @Unoqualsiasi tranferred to ftpgeno seem to be based on the "Native platform" positions and so we have different positions for markers depending on chip (50Kv1 position is not based on UMD3.1, 50Kv2 and 777K positions are).
A simple example:
gjuvslan@login-0:~/geno/geno_imputation/ftpgeno/Plink_Input_Files/plink_files$ grep ARS-BFGL-BAC-10172 FinalReport_54kV1_ed1.map FinalReport_54kV2_ed1.map FinalReport_777k.map FinalReport_54kV1_ed1.map:14 ARS-BFGL-BAC-10172 0 4736993 FinalReport_54kV2_ed1.map:14 ARS-BFGL-BAC-10172 0 6371334 FinalReport_777k.map:14 ARS-BFGL-BAC-10172 0 6371334
Issue by Paolo #1
Fix by Paolo:
For matrix format change
grep '11190319_1210' FinalReport_54kV2_feb2011_ed1.txt | cut -f 1 > illumina54k_v2_markerlist.txt
# with (input_file MUST be in illuminamatrix format)
cut -f1 input_file | tail -n +11 >output_file
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.