timknut / geno_imputation Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 34.01 MB

Documentation and code base for the Geno/Roslin imputation project

Shell 94.66% R 5.34%

geno_imputation's People

Contributors

Stargazers

Watchers

Forkers

unoqualsiasi hugo-toledo

geno_imputation's Issues

seqreport_edit.py question

Is that ok to have 'NA' for the missing values in the position column in the annotation file?

the output of the script is a file with header (SNPs name)

tail -n +2 in.ped > out.ped in order to remove it

@timknut what do you think?

Paolo

missing code for doing heterozygosity-filtering.

@Unoqualsiasi Steps 6 here: https://github.com/timknut/geno_imputation/blob/master/scripts/plink_workflow.Rmd

It is unclear how you implemented the HET filter and at what threshold you set it?
Can you please include how you made the file hetero_remove and the HEt threshold? This is the commit #16

Duplicate markers in Illumina files

There are quite a few duplicate positions for markers with differrent names in the raw file set FinalReport_54kV2_collection_ed1.ped and FinalReport_54kV2_collection_ed1.map.

Found these:

CHR     POS     ALLELES IDS
1       59409838        1,2     ARS-USMARC-Parent-DQ404150-rs29012530 UA-IFASA-2167
1       151349514       1,3     ARS-USMARC-Parent-DQ404151-rs29019282 Hapmap35832-SCAFFOLD197372_885
2       111155237       1,3     ARS-USMARC-Parent-DQ786757-rs29019900 Hapmap36382-SCAFFOLD210095_19074
3       58040470        1,2     ARS-USMARC-Parent-DQ435443-rs29010802 Hapmap52375-rs29010802
3       116448759       1,2     ARS-USMARC-Parent-DQ839235-rs29012691 Hapmap38870-BTA-01737
4       17200594        1,3     ARS-USMARC-Parent-DQ647186-rs29014143 Hapmap58054-rs29014143
4       94176209        1,3     ARS-USMARC-Parent-DQ485413-no-rs Hapmap33892-BES6_Contig314_677
7       18454636        1,2     ARS-USMARC-Parent-DQ786758-rs29024430 Hapmap36218-SCAFFOLD41765_2717
8       88974063        1,2     ARS-USMARC-Parent-DQ837644-rs29010468 UA-IFASA-2827
8       106174871       1,3     ARS-USMARC-Parent-DQ674265-rs29011266 Hapmap36391-SCAFFOLD165033_11046
9       45729853        1,3     ARS-USMARC-Parent-DQ846689-rs29011985 UA-IFASA-1922
9       98483346        1,2     ARS-USMARC-Parent-DQ786765-rs29009858 UA-IFASA-2515
10      55611885        1,3     ARS-USMARC-Parent-DQ984827-rs29012019 Hapmap59786-rs29012019
12      80629629        1,2     ARS-USMARC-Parent-DQ832700-rs29012872 Hapmap36566-SCAFFOLD135238_3808
13      25606469        1,4     ARS-USMARC-Parent-EF034081-rs29009668 Hapmap36096-SCAFFOLD140080_30362
14      48380429        1,3     ARS-USMARC-Parent-DQ846691-rs29019814 Hapmap35881-SCAFFOLD20653_10639
15      21207529        1,3     ARS-USMARC-Parent-EF042090-no-rs Hapmap35077-BES9_Contig405_919
15      38078775        1,3     ARS-USMARC-Parent-DQ866817-no-rs Hapmap34596-BES7_Contig444_1293
15      79187295        1,2     ARS-USMARC-Parent-DQ866818-rs29011701 UA-IFASA-5162
18      1839733 1,3     ARS-USMARC-Parent-EF028073-rs29014953 Hapmap57363-rs29014953
20      676757  1,3     ARS-USMARC-Parent-DQ984828-rs29010004 Hapmap59181-rs29010004
20      17837675        1,3     ARS-USMARC-Parent-DQ888313-no-rs Hapmap34041-BES1_Contig298_838
21      65198296        1,2     ARS-USMARC-Parent-EF026085-rs29021607 Hapmap35417-SCAFFOLD255533_15525
22      56526462        1,3     ARS-USMARC-Parent-EF034082-rs29013532 Hapmap55319-rs29013532
26      8221270 1,3     ARS-USMARC-Parent-DQ990834-rs29013727 Hapmap53362-rs29013727
26      38233337        1,3     ARS-USMARC-Parent-EF034086-no-rs Hapmap35000-BES9_Contig272_944
28      35331560        1,3     ARS-USMARC-Parent-EF026086-rs29013660 Hapmap36071-SCAFFOLD106623_11509
28      44261945        1,3     ARS-USMARC-Parent-EF042091-rs29014974 Hapmap36794-SCAFFOLD186736_5402
29      28647816        1,3     ARS-USMARC-Parent-EF034080-rs29024749 Hapmap36059-SCAFFOLD50303_4748

Have you seen these, Paolo?
https://www.cog-genomics.org/plink2/data#list_duplicate_vars can deal with them.

use the 25K data Tim provided in Dropbox

Tell when this is Ok.

sample_id ID issue Nordic files

All the Nordic files present this issue in the metadata necessary to convert the sample_id into ID (in order to provide a link with the pedigree for the imputation).

example:

sample_id RDCDNKM000001010000963
ID 1.057E9

Someone here messed with column attributes in some editing step

This could be a big problem because we don't know what program they used (we can have a precision problem if we try to convert them back)....any ideas?
Paolo

Have a common rawdata file tree.

At the current time my DIR tree looks like this.
Do you agree @Unoqualsiasi ?

Illumina TOP alleles

When the LAB exports illumina genotypes, they usually export them in the TOP-strand format.
Our pipeline expects this.

However, some of the data can be in another format,

Check

Download the annotation from snpCHIMP with the TOP-alleles, and compare for the illumina files.

Nordic_54k_2012_ed1_markerlist

The markers of this file have some problems. Someone edited them before.

Example :
marker original name: ARS-USMARC-Parent-AY761135-RS29003723

Noric_54k_2012_ed1 name: ARS-USMARC-PARENT-AY761135-RS29003723

This problem allows the compatibility of only 34732 markers over 52259. I will try to fix it.

UPDATE: i was able to recover the information

The changes i made in Noric_54k_2012_ed1 are:

Parent instead of PARENT
-no instead of -NO
-rs instead of -RS
_Contig instead of _CONTIG
Hapmap instead of HAPMAP

You can close the issue 🎱

Paolo

markerlist file FinalReport_54kV1_ed1.txt

It appears that this file contains 73628 SNPs instead of 54001 as reported in the header of the file -.-

grep '2005' FinalReport_54kV1_ed1.txt | cut -f 1 > FinalReport_54kV1_ed1_markerlist.txt

wc -l FinalReport_54kV1_ed1_markerlist.txt

Add the summarize rawdata scripts and final table to the Repo.

@Unoqualsiasi adds script to produce table for all raw-data samples not beeing in the collection.
Tim adds script and table for the collection stuff.

Approximate table format:

Different positions for same marker in different plink files

Ref commits : a1c243e and 5a3b612

The plink files @Unoqualsiasi tranferred to ftpgeno seem to be based on the "Native platform" positions and so we have different positions for markers depending on chip (50Kv1 position is not based on UMD3.1, 50Kv2 and 777K positions are).

A simple example:
gjuvslan@login-0:~/geno/geno_imputation/ftpgeno/Plink_Input_Files/plink_files$ grep ARS-BFGL-BAC-10172 FinalReport_54kV1_ed1.map FinalReport_54kV2_ed1.map FinalReport_777k.map FinalReport_54kV1_ed1.map:14 ARS-BFGL-BAC-10172 0 4736993 FinalReport_54kV2_ed1.map:14 ARS-BFGL-BAC-10172 0 6371334 FinalReport_777k.map:14 ARS-BFGL-BAC-10172 0 6371334

Are these merged into a single position at some point before Alphaimpute?
If yes, where is this done? If, no how does alphaimpute handle this?

Code example for extracting SNP-names from Matrix-format lacking

Issue by Paolo #1

Fix by Paolo:

For matrix format change

grep '11190319_1210' FinalReport_54kV2_feb2011_ed1.txt | cut -f 1 > illumina54k_v2_markerlist.txt

# with (input_file MUST be in illuminamatrix format)

cut -f1 input_file | tail -n +11 >output_file