Git Product home page Git Product logo

geno_imputation's People

Contributors

argju avatar timknut avatar unoqualsiasi avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

geno_imputation's Issues

seqreport_edit.py question

Is that ok to have 'NA' for the missing values in the position column in the annotation file?

the output of the script is a file with header (SNPs name)

tail -n +2 in.ped > out.ped in order to remove it

@timknut what do you think?

Paolo

Duplicate markers in Illumina files

There are quite a few duplicate positions for markers with differrent names in the raw file set FinalReport_54kV2_collection_ed1.ped and FinalReport_54kV2_collection_ed1.map.

Found these:

CHR     POS     ALLELES IDS
1       59409838        1,2     ARS-USMARC-Parent-DQ404150-rs29012530 UA-IFASA-2167
1       151349514       1,3     ARS-USMARC-Parent-DQ404151-rs29019282 Hapmap35832-SCAFFOLD197372_885
2       111155237       1,3     ARS-USMARC-Parent-DQ786757-rs29019900 Hapmap36382-SCAFFOLD210095_19074
3       58040470        1,2     ARS-USMARC-Parent-DQ435443-rs29010802 Hapmap52375-rs29010802
3       116448759       1,2     ARS-USMARC-Parent-DQ839235-rs29012691 Hapmap38870-BTA-01737
4       17200594        1,3     ARS-USMARC-Parent-DQ647186-rs29014143 Hapmap58054-rs29014143
4       94176209        1,3     ARS-USMARC-Parent-DQ485413-no-rs Hapmap33892-BES6_Contig314_677
7       18454636        1,2     ARS-USMARC-Parent-DQ786758-rs29024430 Hapmap36218-SCAFFOLD41765_2717
8       88974063        1,2     ARS-USMARC-Parent-DQ837644-rs29010468 UA-IFASA-2827
8       106174871       1,3     ARS-USMARC-Parent-DQ674265-rs29011266 Hapmap36391-SCAFFOLD165033_11046
9       45729853        1,3     ARS-USMARC-Parent-DQ846689-rs29011985 UA-IFASA-1922
9       98483346        1,2     ARS-USMARC-Parent-DQ786765-rs29009858 UA-IFASA-2515
10      55611885        1,3     ARS-USMARC-Parent-DQ984827-rs29012019 Hapmap59786-rs29012019
12      80629629        1,2     ARS-USMARC-Parent-DQ832700-rs29012872 Hapmap36566-SCAFFOLD135238_3808
13      25606469        1,4     ARS-USMARC-Parent-EF034081-rs29009668 Hapmap36096-SCAFFOLD140080_30362
14      48380429        1,3     ARS-USMARC-Parent-DQ846691-rs29019814 Hapmap35881-SCAFFOLD20653_10639
15      21207529        1,3     ARS-USMARC-Parent-EF042090-no-rs Hapmap35077-BES9_Contig405_919
15      38078775        1,3     ARS-USMARC-Parent-DQ866817-no-rs Hapmap34596-BES7_Contig444_1293
15      79187295        1,2     ARS-USMARC-Parent-DQ866818-rs29011701 UA-IFASA-5162
18      1839733 1,3     ARS-USMARC-Parent-EF028073-rs29014953 Hapmap57363-rs29014953
20      676757  1,3     ARS-USMARC-Parent-DQ984828-rs29010004 Hapmap59181-rs29010004
20      17837675        1,3     ARS-USMARC-Parent-DQ888313-no-rs Hapmap34041-BES1_Contig298_838
21      65198296        1,2     ARS-USMARC-Parent-EF026085-rs29021607 Hapmap35417-SCAFFOLD255533_15525
22      56526462        1,3     ARS-USMARC-Parent-EF034082-rs29013532 Hapmap55319-rs29013532
26      8221270 1,3     ARS-USMARC-Parent-DQ990834-rs29013727 Hapmap53362-rs29013727
26      38233337        1,3     ARS-USMARC-Parent-EF034086-no-rs Hapmap35000-BES9_Contig272_944
28      35331560        1,3     ARS-USMARC-Parent-EF026086-rs29013660 Hapmap36071-SCAFFOLD106623_11509
28      44261945        1,3     ARS-USMARC-Parent-EF042091-rs29014974 Hapmap36794-SCAFFOLD186736_5402
29      28647816        1,3     ARS-USMARC-Parent-EF034080-rs29024749 Hapmap36059-SCAFFOLD50303_4748

Have you seen these, Paolo?
https://www.cog-genomics.org/plink2/data#list_duplicate_vars can deal with them.

sample_id ID issue Nordic files

All the Nordic files present this issue in the metadata necessary to convert the sample_id into ID (in order to provide a link with the pedigree for the imputation).

example:

sample_id RDCDNKM000001010000963
ID 1.057E9

Someone here messed with column attributes in some editing step

This could be a big problem because we don't know what program they used (we can have a precision problem if we try to convert them back)....any ideas?
Paolo

Illumina TOP alleles

When the LAB exports illumina genotypes, they usually export them in the TOP-strand format.
Our pipeline expects this.

However, some of the data can be in another format,

Check

Download the annotation from snpCHIMP with the TOP-alleles, and compare for the illumina files.

Nordic_54k_2012_ed1_markerlist

The markers of this file have some problems. Someone edited them before.

Example :
marker original name: ARS-USMARC-Parent-AY761135-RS29003723

Noric_54k_2012_ed1 name: ARS-USMARC-PARENT-AY761135-RS29003723

This problem allows the compatibility of only 34732 markers over 52259. I will try to fix it.

UPDATE: i was able to recover the information

The changes i made in Noric_54k_2012_ed1 are:

Parent instead of PARENT
-no instead of -NO
-rs instead of -RS
_Contig instead of _CONTIG
Hapmap instead of HAPMAP

You can close the issue ๐ŸŽฑ

Paolo

markerlist file FinalReport_54kV1_ed1.txt

It appears that this file contains 73628 SNPs instead of 54001 as reported in the header of the file -.-

grep '2005' FinalReport_54kV1_ed1.txt | cut -f 1 > FinalReport_54kV1_ed1_markerlist.txt

wc -l FinalReport_54kV1_ed1_markerlist.txt

Different positions for same marker in different plink files

Ref commits : a1c243e and 5a3b612

The plink files @Unoqualsiasi tranferred to ftpgeno seem to be based on the "Native platform" positions and so we have different positions for markers depending on chip (50Kv1 position is not based on UMD3.1, 50Kv2 and 777K positions are).

A simple example:
gjuvslan@login-0:~/geno/geno_imputation/ftpgeno/Plink_Input_Files/plink_files$ grep ARS-BFGL-BAC-10172 FinalReport_54kV1_ed1.map FinalReport_54kV2_ed1.map FinalReport_777k.map FinalReport_54kV1_ed1.map:14 ARS-BFGL-BAC-10172 0 4736993 FinalReport_54kV2_ed1.map:14 ARS-BFGL-BAC-10172 0 6371334 FinalReport_777k.map:14 ARS-BFGL-BAC-10172 0 6371334

  • Are these merged into a single position at some point before Alphaimpute?
    If yes, where is this done? If, no how does alphaimpute handle this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.