Git Product home page Git Product logo

pypop's Introduction

PyPI version shields.io Build status Website pypop.org GitHub license Citations DOI

PyPop: Python for Population Genomics

PyPop is a framework for processing genotype and allele data and running population genetic analyses, including conformity to Hardy-Weinberg expectations; tests for balancing or directional selection; estimates of haplotype frequencies and measures and tests of significance for linkage disequilibrium (LD). Full documentation is available in the PyPop User Guide.

How to cite PyPop

If you write a paper that uses PyPop in your analysis, please cite both:

  • our 2024 article in Frontiers in Immunology:

    Lancaster AK, Single RM, Mack SJ, Sochat V, Mariani MP, Webster GD. (2024) "PyPop: A mature open-source software pipeline for population genomics." Front. Immunol. 15:1378512 doi: 10.3389/fimmu.2024.1378512

  • and the Zenodo record for the software. To cite the correct version, follow these steps:

    1. First visit the DOI for the overall Zenodo record: 10.5281/zenodo.10080667. This DOI represents all versions, and will always resolve to the latest one.

    2. When you are viewing the record, look for the Versions box in the right-sidebar. Here are listed all versions (including older versions).

    3. Select and click the version-specific DOI that matches the specific version of PyPop that you used for your analysis.

    4. Once you are visiting the Zenodo record for the specific version, under the Citation box in the right-sidebar, select the citation format you wish to use and click to copy the citation. It will contain link to the version-specific DOI, and be of the form:

      Lancaster, AK et al. (YYYY) "PyPop: Python for Population Genomics" (Version X.Y.Z) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.XXXXX

    Note that citation metadata for the current Zenodo record is also stored in CITATION.cff

Attention!

The package name for installation purposes is pypop-genomics - to avoid conflicting with an unrelated package with the name pypop already on PyPI.

Quickstart Guide

Installing pypop-genomics

If you already have Python and pip installed, install using the following:

pip install pypop-genomics

Otherwise, follow these instructions to install Python 3 and pip.

Once pypop-genomics is installed, depending on your platform, you may also need to adjust your PATH environment variable.

Upgrading pypop-genomics

pip install -U pypop-genomics

Uninstalling pypop-genomics

pip uninstall pypop-genomics

For more, including handling common installation issues, see the detailed installation instructions .

Once you have installed pypop-genomics, you can move on to try some example runs.

Examples

These are examples of how to check that the program is installed and some minimal use cases.

Checking version and installation

pypop --version

This simply reports the version number and other information about PyPop, and indirectly checks that the program is installed. If all is well, you should see something like:

pypop 1.0.0
[Python 3.10.9 | Linux.x86_64-x86_64 | x86_64]
Copyright (C) 2003-2006 Regents of the University of California.
Copyright (C) 2007-2023 PyPop team.
This is free software.  There is NO warranty; not even for
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You can also run pypop --help to see a full list and explanation of all the options available.

Run a minimal dataset:

Download test .ini and .pop files: minimal.ini and USAFEL-UchiTelle-small.pop. You can then run them

pypop -c  minimal.ini USAFEL-UchiTelle-small.pop

If you have already cloned the git repository and it is your working directory, you can simply run

pypop -c  tests/data/minimal.ini tests/data/USAFEL-UchiTelle-small.pop

This will generate the following two files, an XML output file and a plain text version:

USAFEL-UchiTelle-small-out.xml
USAFEL-UchiTelle-small-out.txt

Detailed installation instructions

There are three main steps:

  1. install Python and pip
  2. install package from PyPI
  3. adjusting your PATH variable after installation

Install Python 3 and pip

A full description of installing Python and pip on your system is beyond the scope of this guide, we recommend starting here:

https://wiki.python.org/moin/BeginnersGuide/Download

Here are some additional platform-specific notes that may be helpful:

  • Most Linux distributions come with Python 3 preinstalled. On most modern systems, pip and python will default to Python 3.
  • MacOS 10.9 (Jaguar) up until 12.3 (Catalina), used to ship with Python 2 pre-installed, but it now has to be manually installed. See the MacOS quick-start guide in the official documentation for how to install Python 3. (Note that if Python is installed on Mac via the MacOS developer tools, it may include the version 3 suffix on commands, e.g. python3 and pip3, so modify the below, accordingly).
  • For Windows, see also the Windows quick-start guide in the official documentation. Running python in the Windows command terminal in Windows 11 and later will launch the installer for the Microsoft-maintained Windows package of Python 3.

Install package from PyPI

Once you have both python and pip installed, you can use pip to install pre-compiled binary "wheels" of pypop-genomics directly from PyPI.

pip install pypop-genomics

Note

If, for whatever reason, you cannot use the these binaries (e.g. the pre-compiled binaries are not available for your platform), you may need to follow the developer installation instructions in the contributors guide.

Upgrade an existing PyPop installation

To update an existing installation to a newer version, use the same command as above, but add the --upgrade (short version: -U) flag, i.e.

pip install -U pypop-genomics

Installing from Test PyPI

From time to time, we may make available packages on the Test PyPI instance, rather than the through the main instance. The above installation and updating instructions can be used, by appending the following:

--extra-index-url https://test.pypi.org/simple/

to the above pip commands.

Issues with installation permission

By default, pip will attempt to install the pypop-genomics package wherever the current Python installation is installed. This location may be a user-specific virtual environment (like conda, see below), or a system-wide installation. On many Unix-based systems, Python will generally already be pre-installed in a "system-wide" location (e.g. under /usr/lib) which is read-only for regular users. (This can also be true for system-installed versions of Python on Windows and MacOS.)

When pip install cannot install in a read-only system-wide location , pip will gracefully "fall-back" to installing just for you in your home directory (typically ~/.local/lib/python<VER> where <VER> is the version number of your current Python). In general, this is what is wanted, so the above instructions are normally sufficient.

However, you can also explicitly set installation to be in the user directory, by adding the --user command-line option to the pip install command, i.e.:

pip install pypop-genomics --user

This may be necessary in certain cases where pip install doesn't install into the expected user directory.

Installing within a conda environment

In the special case that you installing from within an activated user-specific conda virtual environment that provides Python, then you should not add the --user because it will install it in ~/.local/lib/ rather than under the user-specific conda virtual environment in ~/.conda/envs/.

Post-install PATH adjustments

You may need to adjust the PATH settings (especially on Windows) for the pypop scripts to be visible when run from your console application, without having to supply the full path to the pypop executable file.

Warning

Pay close attention to the "WARNINGS" that are shown during the pip installation, they will often note which directories need to be added to the PATH.

  • On Linux and MacOS, systems this is normally fairly simple and only requires edit of the shell .profile, or similar and addition of the $HOME/.local/bin to the PATH variable, followed by a restart of the terminal.
  • For Windows, however, as noted in most online instructions, this may need additional help from your system administrator if your user doesn't have the right permissions, and also require a system reboot.

Uninstalling PyPop

To uninstall the current version of pypop-genomics:

pip uninstall pypop-genomics

Support and development

Please submit any bug reports, feature requests or questions, via our GitHub issue tracker (see our bug reporting guidelines for more details on how to file a good bug report):

https://github.com/alexlancaster/pypop/issues

Please do not report bugs via private email to developers.

The development of the code for PyPop is via our GitHub project:

https://github.com/alexlancaster/pypop

For a detailed description on bug reporting as well as how to contribute to PyPop, please consult our CONTRIBUTING.rst guide. For reporting security vulnerabilities visit SECURITY.md.

We also have additional notes and background relevant for developers in DEV_NOTES.md. Source for the website and the documentation is located in the website subdirectory.

Copyright and License

PyPop is Copyright (C) 2003-2006. The Regents of the University of California (Regents)

Copyright (C) 2007-2023 PyPop team.

PyPop is distributed under the terms of GPLv2

pypop's People

Contributors

akkornel avatar alexlancaster avatar dependabot[bot] avatar diogomeyer avatar gwebster avatar mmariani123 avatar odoublewen avatar rms03 avatar rsingle avatar vsoch avatar ystsai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pypop's Issues

Problem with HWE calculation

We have been using PyPopLinux 0.7.0
I have been calculating HWE p value for different HLA loci. but the p value was numerically unstable when they were calculated using same data again and again (p value sometimes is at 0.0565 and sometimes 0.0003) so unable to decide on deviation from HWE.
What is the reason for this changing p value. Guo and THompson p value changes while chisquareHWE remains the same.
Is there a way to reduce the variation in the HWE pvalues
with regards
Vani

increase the number of individuals and loci that emhaplofreq can process

@sjmack, in the haplo.stats branch, I've increased the number the individuals to 100k and 20 loci that emhaplofreq can accept. I have processed a data file that consisted of your 5000 line synthetic data set repeated 20 times. To test:

 git checkout haplo-stats
 ./setup.py build

You need one more Python package psutils for testing, details here: https://github.com/alexlancaster/pypop/tree/haplo-stats#running-test-suite

Then run the test on it's own:

 py.test -s -v tests/test_100k_Emhaplofreq.py

You'll need to make sure you have at least 6-8 GB free of memory. I'm not exactly sure how much memory is really required, but you'll probably want as much as you can.

This will estimate the haplotype frequencies for just a single pair of loci. Without knowing how representative of workshop data the synthetic data is, it's difficult to know how well this will work in general, but it provides a potential fallback.

Locus mixup in haplotype returns in .tsv outputs

The WS_BDCtrl_Test_EM.ini config file specifies LociToEstHaplo=a:drb1:drb3, and running ./bin/pypop.py --generate-tsv -c WS_BDCtrl_Test_EM.ini BIGDAWG_SynthControl_Data.pop generates the following in the *out.txt file:

Haplotype frequency est. for loci: A:DRB1:DRB3
----------------------------------------------
Number of individuals: 1002 (before-filtering)
Number of individuals: 991 (after-filtering)
Unique phenotypes: 303
Unique genotypes: 907
Number of haplotypes: 618
Loglikelihood under linkage equilibrium [ln(L_0)]: -8090.394069
Loglikelihood obtained via the EM algorithm [ln(L_1)]: -7061.268470
Number of iterations before convergence: 267

Haplotypes sorted by frequency        
haplotype                                 frequency# copies
01:01:01:01~01:01:01~00:00                0.08110  160.7   
24:02:01:01~07:01:01:01~00:00             0.07921  157.0   
32:02~03:01:02~01:01:02:01                0.05701  113.0   
03:01:03~15:01:01:01~00:00                0.05449  108.0   
31:01:02:01~03:01:01:01~01:01:02:02       0.04338  86.0    

You can tell that the order of loci in the header matches the order of alleles in the haplotype, because there is no DRB1 or DRB3 24:02:01:01 allele, and because only DRB3, DRB4 or DRB5 should be noted as "absent" using 00:00.

However, the loci in the haplotype in a different order in the 3-locus-haplotype.dat file:

:pypop sjmack$ more 3-locus-haplo.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     01:01:01:01~01:01:01~00:00      0.08110 160.7   
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     24:02:01:01~07:01:01:01~00:00   0.07921 157.0   
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     32:02~03:01:02~01:01:02:01      0.05701 113.0   
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     03:01:03~15:01:01:01~00:00      0.05449 108.0   
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     31:01:02:01~03:01:01:01~01:01:02:02     0.04338 86.0    

Similarly, the 3-locus-summary.dat file also has mis-ordered locus names:

:pypop sjmack$ more 3-locus-summary.dat 
n.gametes       locus1  locus2  locus3
0       ****    ****    ****    ****    ****    ****    ****    ****    1982    DRB3    A       DRB1    ****    

./bin/popmeta.py --disable-ihwg BIGDAWG_SynthControl_Data-out.xml generates the same locus-order problem but without the IHWG headers (which I now realize is why I wasn't using --generate-tsv in the pypop command line).

:pypop sjmack$ more 3-locus-haplo.dat 
popname locus   allele  allele.freq     allele.count
0       DRB3:A:DRB1     01:01:01:01~01:01:01~00:00      0.08110 160.7   
0       DRB3:A:DRB1     24:02:01:01~07:01:01:01~00:00   0.07921 157.0   
0       DRB3:A:DRB1     32:02~03:01:02~01:01:02:01      0.05701 113.0   
0       DRB3:A:DRB1     03:01:03~15:01:01:01~00:00      0.05449 108.0   
0       DRB3:A:DRB1     31:01:02:01~03:01:01:01~01:01:02:02     0.04338 86.0  
:pypop sjmack$ more 3-locus-summary.dat 
n.gametes       locus1  locus2  locus3
0       1982    DRB3    A       DRB1    ****    

(Also, it looks like there are some extra fields in the 3-locus-summary.dat file. Not sure what that is about.)

The order of the loci listed in the .tsv files should match the order of the loci in the corresponding .ini file. Someone who isn't paying close attention to the results would think that "DRB3:A:DRB1 24:02:01:0107:01:01:0100:00" means "DRB324:02:01:01~A07:01:01:01~DRB1*00:00", and any system that is automatically assembling GL strings from these data would make the same mistake.

Test Error

Hi Alex

I have installed pypop on my mac. I have updated almost everything with Catalina. I get the following error.

________________________________________________________ test_AlleleColon_Emhaplofreq ________________________________________________________

def test_AlleleColon_Emhaplofreq():

    exit_code = base.run_pypop_process('./tests/data/Test_Allele_Colon_Emhaplofreq.ini', './tests/data/Test_Allele_Colon_Emhaplofreq.pop')
    # check exit code
    assert exit_code == 0
    # compare with md5sum of output file
  assert hashlib.md5(open("Test_Allele_Colon_Emhaplofreq-out.txt", 'rb').read()).hexdigest() == '598954bfe301d5dea44ccd4d443905d1'

E AssertionError: assert '0434be553e01...190107fa48035' == '598954bfe301d...ccd4d443905d1'
E - 0434be553e01a165994190107fa48035
E + 598954bfe301d5dea44ccd4d443905d1

tests/test_AlleleColon.py:19: AssertionError
________________________________________________________________ test_USAFEL _________________________________________________________________

def test_USAFEL():
    exit_code = base.run_pypop_process('./tests/data/minimal.ini', './tests/data/USAFEL-UchiTelle-small.pop')
    # check exit code
    assert exit_code == 0

    out_filename = "USAFEL-UchiTelle-small-out.txt"
    gold_out_filename = os.path.join('./tests/data/output', out_filename)
  assert filecmp.cmp(out_filename, gold_out_filename)

E AssertionError: assert False
E + where False = <function cmp at 0x10c106dd0>('USAFEL-UchiTelle-small-out.txt', './tests/data/output/USAFEL-UchiTelle-small-out.txt')
E + where <function cmp at 0x10c106dd0> = filecmp.cmp

tests/test_USAFEL.py:15: AssertionError
========================================= 2 failed, 22 passed, 2 skipped, 1 xfailed in 62.93 seconds =========================================

Modify PopMeta parameters so that IHWG headers have to be opted-in

Currently, popmeta.py defaults to expect IHWG header block data in the out.xml files. In order to generate compact TSV files without numerous extra fields, one must supply the --disable-ihwg argument.

A this point, it seems unlikely that many users are using the specific IHWG headers, so it would probably be useful to change the --disable-ihwg argument to --enable-ihwg, so that the default is no expectation of header data.

This would also make it possible to run pypop with the --generate-tsv argument, and generate compact TSV files, without having to use popmeta separately.

Problem with Emhaplofreq LD calculations

The Wn and D' values being returned by [Emhaplofreq] for ./bin/pypop.py -c minimal.ini sample.pop (using the attached files) are >1, but they should range from 0 to 1. See below.

I know that [Emhaplofreq] is depreciated, but it seems like this indicates that something is broken in it.

Pairwise LD estimates
---------------------
Locus pair            D      D'        Wn   ln(L_1)   ln(L_0)         S # permu p-value
A:C             0.06126 1.37143   1.63299   -289.09   -246.03    -86.13       - -      
A:B             0.05896 1.37143   1.63299   -293.47   -244.64    -97.66       - -      
A:DRB1          0.05266 1.24762   2.33631   -282.00   -272.35    -19.32       - -      
A:DQA1          0.05875 1.24762   2.86138   -269.57   -274.43      9.71       - -      
A:DQB1          0.04204 1.13675   3.64692   -275.58   -336.12    121.06       - -      
A:DPA1          0.12862 2.30779   2.36391   -219.78   -178.36    -82.83       - -      
A:DPB1          0.07316 1.38957   1.98626   -237.85   -225.61    -24.49       - -      
C:B             0.07052 1.24704   1.39196   -210.37   -247.53     74.32       - -      
C:DRB1          0.05513 1.27630   2.69649   -280.34   -313.37     66.06       - -      
C:DQA1          0.08671 1.24054   2.18414   -263.23   -240.39    -45.68       - -      
C:DQB1          0.08262 1.24054   1.95355   -269.55   -239.70    -59.70       - -      
C:DPA1          0.12461 1.22826   3.01602   -224.72   -232.31     15.18       - -      
C:DPB1          0.12047 1.46364   1.33749   -242.45   -175.80   -133.31       - -      
B:DRB1          0.04831 1.21417   2.98728   -286.79   -330.85     88.12       - -      
B:DQA1          0.07547 1.21852   2.42613   -271.36   -257.87    -26.98       - -      
B:DQB1          0.07175 1.21852   2.17000   -277.30   -257.18    -40.24       - -      
B:DPA1          0.10938 1.24112   3.36962   -229.76   -250.36     41.20       - -      
B:DPB1          0.10966 1.47727   1.41639   -247.84   -182.99   -129.70       - -      
DRB1:DQA1       0.06273 1.24439   3.46731   -164.06   -323.72    319.31       - -      
DRB1:DQB1       0.05851 1.24439   3.10125   -147.74   -323.03    350.58       - -      
DRB1:DPA1       0.10072 1.27327   5.31875   -202.51   -331.99    258.96       - -      
DRB1:DPB1       0.08588 1.54545   2.17307   -231.03   -255.74     49.42       - -      
DQA1:DQB1       0.08955 1.21277   2.29619   -152.52   -250.05    195.06       - -      
DQA1:DPA1       0.14302 1.23791   3.27044   -185.25   -247.18    123.87       - -      
DQA1:DPB1       0.12423 1.49091   1.84842   -210.93   -196.13    -29.60       - -      
DQB1:DPA1       0.13263 1.23791   3.27044   -193.17   -246.49    106.63       - -      
DQB1:DPB1       0.11754 1.49091   1.65328   -222.73   -195.44    -54.58       - -      
DPA1:DPB1       0.19210 1.45478   2.29742   -165.96   -172.34     12.75       - -      

minimal.ini.txt
sample.pop.txt

PyPop Data formats vs HLA nomenclature "formats"

We should make sure that we are on the same page when talking about "format" for loading data into PyPop. Currently, PyPop 0.7.0 supports genotype data in a tab-delimited format, and only fully supports HLA data using IPD-IMGT/HLA Database version 2.x.x alleles (or lower).

Including HLA alleles from IPD-IMGT/HLA Database version 3.x.x results in the following issues.

  1. Hardy-Weinberg tests performed on these data cause PyPop to crash. These data have to have the colons removed in order to do HW tests with PyPop.

  2. Haplotype results are delimited using colons. This makes it very difficult to read haplotype results when the input data have colons in them.

Suggested solutions:

  1. Fix the HW module so that colons in the allele names don't break it.

  2. Use the GL string phase delimiter (~) for returning haplotype results.

HWE output issues: need to accommodate longer colon-delimited allele names

:pypop sjmack$ ./bin/pypop.py -c WS_BDCtrl_Test_HW.ini BIGDAWG_SynthControl_Data.pop is generating the following output for the Hardy Weinberg common genotypes test:

------------------------------------------------------------------------------
Common genotypes
:01:01+01:01:01:01           7        6.88        0.00        0.9621      
:01:01:01+02:05:01          11       11.76        0.05        0.8241      
:01:01+03:01:01:01          12       12.01        0.00        0.9975      
:01:01:01+03:01:03           9        8.95        0.00        0.9856      
:01:01+24:02:01:01          13       13.00        0.00        0.9989      
:01:01:01+26:01:01           8        7.95        0.00        0.9864      
 01:01:01:01+26:08           9        9.19        0.00        0.9488      
:01:01+29:02:01:02           9        8.95        0.00        0.9856      
:01:01+31:01:02:01           8        7.95        0.00        0.9864      
 01:01:01:01+32:02          16       16.82        0.04        0.8424     

As with Issue #18 the increased size of the colon-delimited allele names is larger than the hard-coded 18 characters allowed for a + delimited allele pair.

With digit delimited allele names, the maximum length for a pair would have been 20 characters [but in retrospect, I think that PyPop is stripping expression variant suffixes like N and L, as those are optional)], so the maximum length would have been 17 characters (01010101+01010101).

As with Issue #18 the size of the Common genotypes allele pair field needs to be increased to at least 25 characters (e.g., 104:01:01:01+06:127:01:01), and may possibly need to be made flexible to accommodate longer allele names.

Failing to run ./setup.py test

image

I gitcloned Pypop but still fail to run setup.py test to check if its correctly installed. I am generally a python novice please kindly point in the right direction

Popmeta 3-locus-summary.dat files have missing headers/extra columns/columns out of order

./bin/pypop.py --generate-tsv -c WS_BDCtrl_Test_EM_lite.ini BIGDAWG_SynthControl_Data.pop in which [Emhaplofreq] has LociToEstHaplo=a:drb1:drb3 generates the following 3-locus-summary.dat file:

n.gametes       locus1  locus2  locus3
0       ****    ****    ****    ****    ****    ****    ****    ****    1982    DRB3    A       DRB1    ****    
:pypop sjmack$ more WS_BDCtrl_Test_EM_lite.ini

Which includes 8 extra fields in the data block, and 7 spaces in the header block between the "n.gametes" and "locus1" column names. There is also an extra field after the Locus3 field.

the n.gametes field should correspond to the 1982 data, so that column name is really far out of place.

When I add the DQB1_1 and DQB1_2 loci to validSampleFields and set [Emhaplofreq] to LociToEstHaplo=a:drb1:drb3:dqb1 in the .ini, the 4-locus-summary.dat file looks like this:

pop     labcode method  ethnic  collect.site    region  latit   longit  complex n.gametes       locus1  locus2  locus3  locus4
0       ****    ****    ****    ****    ****    ****    ****    ****    1982    DRB3    A       DRB1    DQB1    ****    

Here, the ihwg column names appear in the header. This makes it (relatively clear) that the data for the pop field is 0 (which corresponds to Disease in the .pop file, which is identified as the popNameDesignator in the .ini). However, this identifier (pop or popname, below) is missing in the 3-locus-summary.dat file.

When I run ./bin/popmeta.py --disable-ihwg *-out.xml the 4-locus-summary.dat file looks like this:

popname n.gametes       locus1  locus2  locus3  locus4
0       1982    DRB3    A       DRB1    DQB1    ****    

Here all of the column headers have corresponding data in the data-block, but there is an extraneous "****" after the locus4 column, with no column header.

It seems like the column headers for these files need to be reorganized.

  1. The extra **** data in the last (unnamed) column needs to be removed in both the 3-locus and 4-locus summary files.

  2. The ihwg column names need to be added to the (currently) default header for the 3-locus-summary.dat file.

  3. There are a lot of extraneous spaces in the header between the names of several columns. I'm not sure if this is a bug or an attempt to space things out. Someone should look at it though.

Test expected output might need updating

Hi

I suspect the test expected output is in need of updating based on these failures.

============================================================================================================== FAILURES ===============================================================================================================
____________________________________________________________________________________________________ test_AlleleColon_Emhaplofreq _____________________________________________________________________________________________________

    def test_AlleleColon_Emhaplofreq():
    
        exit_code = base.run_pypop_process('./tests/data/Test_Allele_Colon_Emhaplofreq.ini', './tests/data/Test_Allele_Colon_Emhaplofreq.pop')
        # check exit code
        assert exit_code == 0
        # compare with md5sum of output file
>       assert hashlib.md5(open("Test_Allele_Colon_Emhaplofreq-out.txt", 'rb').read()).hexdigest() == '598954bfe301d5dea44ccd4d443905d1'
E       AssertionError: assert '0434be553e01...190107fa48035' == '598954bfe301d...ccd4d443905d1'
E         - 0434be553e01a165994190107fa48035
E         + 598954bfe301d5dea44ccd4d443905d1

tests/test_AlleleColon.py:19: AssertionError
_____________________________________________________________________________________________________________ test_USAFEL _____________________________________________________________________________________________________________

    def test_USAFEL():
        exit_code = base.run_pypop_process('./tests/data/minimal.ini', './tests/data/USAFEL-UchiTelle-small.pop')
        # check exit code
        assert exit_code == 0
    
        out_filename = "USAFEL-UchiTelle-small-out.txt"
        gold_out_filename = os.path.join('./tests/data/output', out_filename)
>       assert filecmp.cmp(out_filename, gold_out_filename)
E       AssertionError: assert False
E        +  where False = <function cmp at 0x1049fae60>('USAFEL-UchiTelle-small-out.txt', './tests/data/output/USAFEL-UchiTelle-small-out.txt')
E        +    where <function cmp at 0x1049fae60> = filecmp.cmp

tests/test_USAFEL.py:15: AssertionError
===================================================================================== 2 failed, 22 passed, 2 skipped, 1 xfailed in 124.70 seconds =====================================================================================

When I diff the output for that last test I get:

$ diff USAFEL-UchiTelle-small-out.txt tests/data/output/USAFEL-UchiTelle-small-out.txt
218,221c218,221
< Locus pair            D      D'        Wn   ln(L_1)   ln(L_0)         S   ALD_1_2   ALD_2_1
< A:C             0.04309 1.20192   2.43975    -36.04    -50.35     28.61         *         *
< A:B                 nan     inf       inf    -26.51    -28.42      3.82         *         *
< C:B                 nan     inf       inf    -32.10    -21.53    -21.13         *         *
---
> Locus pair            D      D'        Wn   ln(L_1)   ln(L_0)         S # permu p-value
> A:C             0.04309 1.20192   2.43975    -36.04    -50.35     28.61       - -      
> A:B                 nan     inf       inf    -26.51    -28.42      3.82       - -      
> C:B                 nan     inf       inf    -32.10    -21.53    -21.13       - -      

Empty popmeta TSV files are being generated for analyses not called for in the config file

./bin/pypop.py --generate-tsv -c WS_BDCtrl_Test_HW_lite.ini BIGDAWG_SynthControl_Data.pop

Generates the following .TSV files:

1-locus-allele.dat		1-locus-pairwise-fnd.dat	2-locus-summary.dat		4-locus-haplo.dat
1-locus-genotype.dat		1-locus-summary.dat		3-locus-haplo.dat		4-locus-summary.dat
1-locus-hardyweinberg.dat	2-locus-haplo.dat		3-locus-summary.dat

Even though WS_BDCtrl_Test_HW_lite.ini is calling for HW tests at a single locus:

;; comment out or change as desired
;; 1 = true, 0 = false

[General]
debug=0

[ParseGenotypeFile]
untypedAllele=****

;; designates field name that holds population name
popNameDesignator=+

;; designates field name that holds allele data
alleleDesignator=*

;; valid fields for sample data block
validSampleFields=SampleID
 +Disease
 *A_1
 *A_2
; *B_1
; *B_2
; *C_1
; *C_2
; *DPA1_1
; *DPA1_2
; *DPB1_1
; *DPB1_2
; *DQA1_1
; *DQA1_2
; *DQB1_1
; *DQB1_2
; *DRB1_1
; *DRB1_2
; *DRB3_1
; *DRB3_2
; *DRB4_1
; *DRB4_2
; *DRB5_1
; *DRB5_2
; *HLA-A_1
; *HLA-A_2
; *HLA-B_1
; *HLA-B_2
; *HLA-C_1
; *HLA-C_2
; *HLA-DPA1_1
; *HLA-DPA1_2
; *HLA-DPB1_1
; *HLA-DPB1_2
; *HLA-DQA1_1
; *HLA-DQA1_2
; *HLA-DQB1_1
; *HLA-DQB1_2
; *HLA-DRB1_1
; *HLA-DRB1_2
; *HLA-DRB3_1
; *HLA-DRB3_2


[HardyWeinberg]
lumpBelow=5

[HardyWeinbergGuoThompsonMonteCarlo]
monteCarloSteps=100000

Of the files generated, only 1-locus-allele.dat, 1-locus-summary.dat, 1-locus-genotype.dat and 1-locus-hardyweinberg.dat should be generated. The rest contain only headers.

:pypop sjmack$ more 1-locus-pairwise-fnd.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus1  locus2  metaloci        f.slatkin.fnd
:pypop sjmack$ more 2-locus-haplo.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count    ld.d    ld.dprime       ld.chisq        obs     obs.freq        exp
:pypop sjmack$ more 2-locus-summary.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex n.gametes       locus1  locus2  metaloci        ld.dprime       ld.wn   q.chisq q.df    lrt.pval        lrt.z
:pypop sjmack$ more 3-locus-haplo.dat
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count
:pypop sjmack$ more 3-locus-summary.dat
n.gametes       locus1  locus2  locus3
:pypop sjmack$ more 4-locus-haplo.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count
:pypop sjmack$ more 4-locus-summary.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex n.gametes       locus1  locus2  locus3  locus4
:pypop sjmack$

The other files shouldn't be generated unless the appropriate analyses are calling them.

pypop crash with about 5000 samples

Hi,
i get this error when the number of samples exceeds 1000:
***Error in '/gpfs/software/tools/python2.7/bin/python': double free or corruption (!prev): 0x00000000030202d0 ***
etc.

what's your recommendation?

thanks

ambiguous haplotypes

Hello again,
I'm wondering if Pypop deals with ambiguity represented in a slash between types. From the results it looks like it teats them as one unit, i.e. one new type. Please advise.
thanks,

Format Chen/diff statistics in columns

originally filed by @sjmack on #19

The Chen/Diff test results currently look like this:

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03:02:12+03:02:12 (18/8.818363) 0.0037** 0.0037**
05:03:01:01+03:02:12 (0/18.105788) 0.0001**** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
06:02:01+05:03:01:01 (0/23.499002) 0.0000***** 0.0000*****
06:02:01+06:02:01 (27/14.854291) 0.0000***** 0.0000*****
06:03:01+04:02:01 (8/0.287425) 0.0000***** 0.0000*****
06:05:01+05:03:01:01 (0/18.105788) 0.0000***** 0.0000*****
06:05:01+06:05:01 (18/8.818363) 0.0019** 0.0019**


3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03:02:12+03:02:12 (18/8.818363) 0.0017** 0.0017**
05:03:01:01+03:02:12 (0/18.105788) 0.0000***** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
06:02:01+05:03:01:01 (0/23.499002) 0.0000***** 0.0000*****
06:02:01+06:02:01 (27/14.854291) 0.0007*** 0.0007***
06:03:01+04:02:01 (8/0.287425) 0.0000***** 0.0000*****
06:05:01+05:03:01:01 (0/18.105788) 0.0000**** 0.0000****
06:05:01+06:05:01 (18/8.818363) 0.0016** 0.0016**

Although these Chen/diff results would be easier to read if they were structured more like the other HW results:

03:02:12+03:02:12       (18/8.818363) 0.0037**    0.0037**
05:03:01:01+03:02:12    (0/18.105788) 0.0001****  0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****

However, this latter would be a "nice to have".

where does command place the pypop folder?

Describe the bug
On Ubuntu 20.04 WSL2 on Windows, the software installs fine (no warnings)
but where does/should the following command place the pypop folder?
--user specifies the home directory but what exactly does 'pypopgen' specify?

To Reproduce
Steps to reproduce the behavior:

C:\Users\mmari>miniconda3\condabin\conda activate pypop_testing_07022023

conda create -n pypop_testing_07022023 python=3

conda activate pypop_testing_07022023

pip install --user pypopgen -f https://github.com/alexlancaster/pypop/releases/expanded_assets/v1.0.0-a23

Desktop (please complete the following information):

  • WSL2 UBUNTU 20.04 on WINDOWS 11

remove need for a genotype separator in HardyWeinberg module

Currently a single character genotype separator is used internally to code genotypes which could result in a clash with a character in an allele string. This restriction should be removed, possibly with introduction of a new data type to hold genotype data. This is an extension of the fix for #3

Problem with the reporting of very low haplotype frequencies in *-out.txt files

A haplotype frequency that is sufficiently low to require that it is recorded in scientific notation in the *-out.xml file is not properly recorded in the *-out.txt file.

For example, the attached *-out.xml file includes three haplotypes with very low frequencies:

<haplotype name="05:01:01~131:01:01"><frequency>1.4450935155304534e-05</frequency></haplotype>
<haplotype name="06:02:01~835:01:01"><frequency>7.332759277231474e-09</frequency></haplotype>
<haplotype name="06:04:01~04:01:01"><frequency>9.068719188886876e-07</frequency></haplotype>

In the corresponding *-out.txt file, these frequencies are rounded to seven digits and reported without the exponent, which makes these frequencies appear to be larger than 1 in the *-out.txt file, as below:

05:01:01~131:01:01   1.4450935    
06:02:01~835:01:01   7.3327592         
06:04:01~04:01:01    9.0687191

Weird_Controls-out.txt
Weird_Controls-out.xml.txt

I see that I reported this earlier in 2017! It is still an issue.

build bug

Hello @alexlancaster ,when i try build on macos 11.1
./setup.py build
WARNING: The wheel package is not available.
running build
running build_py
creating build/lib.macosx-11.0-x86_64-2.7
creating build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/HardyWeinberg.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/Arlequin.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/GUIApp.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/ParseFile.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/Homozygosity.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/init.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/Haplo.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/DataTypes.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/Utils.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/Filter.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/RandomBinning.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/Main.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
copying PyPop/Meta.py -> build/lib.macosx-11.0-x86_64-2.7/PyPop
running build_ext
building 'Emhaplofreqmodule' extension
swigging emhaplofreq/emhaplofreq_wrap.i to emhaplofreq/emhaplofreq_wrap_wrap.c
swig -python -ISWIG -o emhaplofreq/emhaplofreq_wrap_wrap.c emhaplofreq/emhaplofreq_wrap.i
creating build/temp.macosx-11.0-x86_64-2.7
creating build/temp.macosx-11.0-x86_64-2.7/emhaplofreq
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -pipe -Os -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/opt/openssl/include -D__SWIG
_=1 -DDEBUG=0 -DEXTERNAL_MODE=1 -DXML_OUTPUT=1 -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include -I/opt/local/include -I/opt/local/include -Iemhaplofreq -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c emhaplofreq/emhaplofreq_wrap_wrap.c -o build/temp.macosx-11.0-x86_64-2.7/emhaplofreq/emhaplofreq_wrap_wrap.o
emhaplofreq/emhaplofreq_wrap_wrap.c:3255:17: error: implicit declaration of function 'main_proc' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
result = (int)main_proc(arg1,(char ()[202][22])arg2,arg3,arg4,arg5,arg6,arg7,arg8,arg9,arg10,arg11,arg12,arg13);
^
emhaplofreq/emhaplofreq_wrap_wrap.c:3196:12: warning: variable 'arg2' is uninitialized when used here [-Wuninitialized]
free(arg2);
^~~~
emhaplofreq/emhaplofreq_wrap_wrap.c:3090:25: note: initialize the variable 'arg2' to silence this warning
char (arg2)[202][22] ;
^
= NULL
1 warning and 1 error generated.
error: command '/usr/bin/clang' failed with exit status 1

echo $CPATH
/opt/local/include:/opt/local/include:
echo $LIBRARY_PATH
/opt/local/lib/:/opt/local/lib/:
echo $PATH
/opt/local/bin:/opt/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Applications/VMware Fusion.app/Contents/Public:/opt/X11/bin:/Library/Apple/usr/bin:/opt/local/bin:/Users/masen/miniconda3/bin:/Users/masen/miniconda3/condabin:/usr/local/mysql/bin:/usr/local/mysql/bin
which python
/opt/local/bin/python
port installed
The following ports are currently installed:
bzip2 @1.0.8_0 (active)
cython_select @0.1_1 (active)
db48 @4.8.30_4 (active)
expat @2.2.10_0 (active)
fftw-3 @3.3.8_1 (active)
gettext @0.19.8.1_2 (active)
gsl @2.6_0 (active)
icu @67.1_2 (active)
libedit 20191231-3.1_0 (active)
libffi @3.3_1 (active)
libgcc @3.0_0 (active)
libgcc10 @10.2.0_3 (active)
libiconv @1.16_1 (active)
libxml2 @2.9.10_1 (active)
libxslt @1.1.34_4 (active)
ncurses @6.2_0 (active)
nosetests_select @0.1_0 (active)
OpenBLAS @0.3.12_0+gcc10+lapack (active)
openssl @1.1.1i_0 (active)
pcre @8.44_1 (active)
pip_select @0.1_2 (active)
py-libxml2 @2.9.10_0 (active)
py-setuptools @50.3.1_0 (active)
py27-cython @0.29.13_0 (active)
py27-libxml2 @2.9.10_0 (active)
py27-libxslt @1.1.34_0 (active)
py27-nose @1.3.7_1 (active)
py27-numpy @1.16.6_2+gfortran+openblas (active)
py27-pip @20.3.3_0 (active)
py27-setuptools @44.1.1_0 (active)
py38-libxml2 @2.9.10_0 (active)
py39-setuptools @50.3.1_0 (active)
python2_select @0.0_3 (active)
python3_select @0.0_2 (active)
python27 @2.7.18_2 (active)
python38 @3.8.6_0 (active)
python39 @3.9.1_0 (active)
python_select @0.3_9 (active)
sqlite3 @3.34.0_0 (active)
swig @4.0.2_1 (active)
swig-python @4.0.2_1 (active)
swig3 @3.0.12_2 (active)
swig3-python @3.0.12_2 (active)
xz @5.2.5_0 (active)
zlib @1.2.11_0 (active)

Can u help me
error.txt

XSLT search fix

If pypop is installed (versus run out of a local git checkout), it can't always find the XSLT if it's installed in non-standard prefix, like ~/.local.

PyPop [Filters] not working

PyPop [Filters] functions aren't working, and are crashing PyPop.

Here are some examples of what happens when I run ./bin/pypop.py -c sm_bin_minimal.ini sample.pop using the following options in the .ini file.

When I include:

[Filters]
;filtersToApply=DigitBinning
filtersToApply=CustomBinning

;[DigitBinning]
;filterType=DigitBinning
;binningDigits=2

Using a .ini file with the above sections alone (and no CustomBinning rules) returns

LOG: Data file has no header data block
Could not parse the CustomBinning rules.

However, adding the following to the .ini:

[CustomBinning]
A=!****/2401
 !0101/0104/0105/0122
 !0201/0209/0243/0266/0275/0283/0289/0297/02G1
 !0206/9226
 !0207/0215
 !0211/0269/0298
 !0222/0223/9204
 !0296/0298
 !0281/9224
 !0301/0320/0321/0326/03G1
 !1101/1121
 !2301/2307
 !2402/2409/2411/2440/2476/2479/24G1
 !2403/2433
 !2408/2412
 !2601/2624/2626
 !3004/3005
 !3101/3114
 !3108/2416
 !3303/3302
 !6801/6811/6833
 !7401/7402

returns:

LOG: Data file has no header data block
Traceback (most recent call last):
  File "./bin/pypop.py", line 336, in <module>
    testMode=testMode)
  File "/Applications/PyPop/pypop/bin/../PyPop/Main.py", line 368, in __init__
    self._runFilters()
  File "/Applications/PyPop/pypop/bin/../PyPop/Main.py", line 541, in _runFilters
    self.matrixHistory.append(filter.doCustomBinning((self.matrixHistory[-1]).copy()))
  File "/Applications/PyPop/pypop/bin/../PyPop/Filter.py", line 945, in doCustomBinning
    if len(individ[i].split("/")) > 1:
AttributeError: 'numpy.ndarray' object has no attribute 'split'

Changing the pertinent section of the .ini to:

[Filters]
filtersToApply=DigitBinning
;filtersToApply=CustomBinning

[DigitBinning]
filterType=DigitBinning
binningDigits=2

Results in:

LOG: Data file has no header data block
Traceback (most recent call last):
  File "./bin/pypop.py", line 336, in <module>
    testMode=testMode)
  File "/Applications/PyPop/pypop/bin/../PyPop/Main.py", line 368, in __init__
    self._runFilters()
  File "/Applications/PyPop/pypop/bin/../PyPop/Main.py", line 525, in _runFilters
    self.matrixHistory.append(filter.doDigitBinning((self.matrixHistory[-1]).copy()))
  File "/Applications/PyPop/pypop/bin/../PyPop/Filter.py", line 921, in doDigitBinning
    if allele[i] != self.untypedAllele and len(allele[i]) > self.binningDigits:
TypeError: len() of unsized object

I haven't checked all of the [Filters] options, but I suspect that a general repair is needed.

Pypop install on Centos 7 fails, looking for pytest-runner

Greetings,

Centos 7 box, Python 2.7.5
I cloned Git repo to my PC, then transferred to Centos box.
I successfully installed dependencies as specified in README.md.
However, Pypop install failed because it needs pytest-runner.
Is pytest-runner the only additional package Pypop needs, or will the install want to go out and look for other dependencies (not listed in README.md)?

[root@head pypop]# python setup.py build
Download error on https://pypi.python.org/simple/pytest-runner/: [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol -- Some packages may not be found!
Couldn't find index page for 'pytest-runner' (maybe misspelled?)
Download error on https://pypi.python.org/simple/: [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol -- Some packages may not be found!
No local packages or download links found for pytest-runner
Traceback (most recent call last):
File "setup.py", line 227, in
tests_require=['pytest', 'psutil'],
File "/usr/lib64/python2.7/distutils/core.py", line 112, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/site-packages/setuptools/dist.py", line 265, in init
self.fetch_build_eggs(attrs.pop('setup_requires'))
File "/usr/lib/python2.7/site-packages/setuptools/dist.py", line 289, in fetch_build_eggs
parse_requirements(requires), installer=self.fetch_build_egg
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 618, in resolve
dist = best[req.key] = env.best_match(req, self, installer)
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 862, in best_match
return self.obtain(req, installer) # try and download/install
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 874, in obtain
return installer(requirement)
File "/usr/lib/python2.7/site-packages/setuptools/dist.py", line 339, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 617, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner')
[root@head pypop]#

I then tried to install pytest-runner manually and got this error:

[root@head pytest-runner-2.11.1]# python setup.py build
Download error on https://pypi.python.org/simple/setuptools_scm/: [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol -- Some packages may not be found!
Download error on https://pypi.python.org/simple/setuptools-scm/: [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol -- Some packages may not be found!
Couldn't find index page for 'setuptools_scm' (maybe misspelled?)
Download error on https://pypi.python.org/simple/: [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol -- Some packages may not be found!
No local packages or download links found for setuptools-scm>=1.15.0
Traceback (most recent call last):
File "setup.py", line 49, in
setuptools.setup(**params)
File "/usr/lib64/python2.7/distutils/core.py", line 112, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/site-packages/setuptools/dist.py", line 265, in init
self.fetch_build_eggs(attrs.pop('setup_requires'))
File "/usr/lib/python2.7/site-packages/setuptools/dist.py", line 289, in fetch_build_eggs
parse_requirements(requires), installer=self.fetch_build_egg
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 618, in resolve
dist = best[req.key] = env.best_match(req, self, installer)
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 862, in best_match
return self.obtain(req, installer) # try and download/install
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 874, in obtain
return installer(requirement)
File "/usr/lib/python2.7/site-packages/setuptools/dist.py", line 339, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 617, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('setuptools-scm>=1.15.0')

Well....sorry. I did a bit more research and installed my other dependencies using pip.
Issue resolved.
Signed, Novice Python user.

[Haplostats] EM is not ignoring missing data

Using the same .ini and .pop files as in Issue #40 , it looks like [Haplostats] is including missing data (****) in its haploype estimation.

Here is an example of [Emhaplofreq] results:

Haplotype frequency est. for loci: A:B:DRB1
-------------------------------------------
Number of individuals: 47 (before-filtering)
Number of individuals: 45 (after-filtering)
Unique phenotypes: 45
Unique genotypes: 113
Number of haplotypes: 188
Loglikelihood under linkage equilibrium [ln(L_0)]: -359.216808
Loglikelihood obtained via the EM algorithm [ln(L_1)]: -340.336732
Number of iterations before convergence: 44

Haplotypes sorted by name             | Haplotypes sorted by frequency        
haplotype            frequency# copies| haplotype            frequency# copies
0101~1301~0402       0.02222  2.0     | 0201~1401~0402       0.04444  4.0     
0101~1301~1101       0.01111  1.0     | 03012~1401~0407      0.03333  3.0     
0101~1401~0901       0.01111  1.0     | 03012~1301~0402      0.03333  3.0     
0101~1520~1602       0.01111  1.0     | 0201~1401~0802       0.03333  3.0     
0101~18012~0407      0.01111  1.0     | 3204~1401~0802       0.02222  2.0     
0101~39021~0404      0.01111  1.0     | 0201~1401~1101       0.02222  2.0     
0101~39021~1602      0.01111  1.0     | 0101~4005~0802       0.02222  2.0     
0101~4005~0802       0.02222  2.0     | 03012~39021~0402     0.02222  2.0     
0101~67011~0802      0.01111  1.0     | 0201~1301~1602       0.02222  2.0     
0101~8101~1602       0.01111  1.0     | 0218~1401~0404       0.02222  2.0     
0201~1301~1602       0.02222  2.0     | 0210~51013~1602      0.02222  2.0     
0201~1401~0402       0.04444  4.0     | 0218~1401~0802       0.02222  2.0     
0201~1401~0404       0.01111  1.0     | 0101~1301~0402       0.02222  2.0     
0201~1401~0407       0.01111  1.0     | 2501~4005~0802       0.02222  2.0     
0201~1401~0802       0.03333  3.0     | 2501~1301~0802       0.02222  2.0     
0201~1401~0901       0.01111  1.0     | 2501~39021~0402      0.02222  2.0     
0201~1401~1101       0.02222  2.0     | 3204~39021~0802      0.02222  2.0     
0201~18012~0407      0.01111  1.0     | 03012~1520~0802      0.02222  2.0     
0201~39021~0901      0.01111  1.0     | 0201~1401~0901       0.01111  1.0     
0201~4005~0402       0.01111  1.0     | 0101~39021~1602      0.01111  1.0     
0201~51013~0802      0.01111  1.0     | 0210~1301~0407       0.01111  1.0     
0201~5703~1101       0.01111  1.0     | 0218~35091~0901      0.01111  1.0     
0210~1301~0402       0.01111  1.0     | 0210~78021~0901      0.01111  1.0     
0210~1301~0407       0.01111  1.0     | 3204~1301~0901       0.01111  1.0     
0210~1401~0404       0.01111  1.0     | 3204~51013~1101      0.01111  1.0     
0210~35091~0402      0.01111  1.0     | 2501~1401~0404       0.01111  1.0     
0210~39021~0802      0.01111  1.0     | 6814~1520~0407       0.01111  1.0     
0210~4005~0802       0.01111  1.0     | 0201~18012~0407      0.01111  1.0     
0210~51013~1602      0.02222  2.0     | 0201~4005~0402       0.01111  1.0     
0210~78021~0901      0.01111  1.0     | 0101~8101~1602       0.01111  1.0     
0218~1301~0402       0.01111  1.0     | 0210~1301~0402       0.01111  1.0     
0218~1401~0402       0.01111  1.0     | 6901~1301~0402       0.01111  1.0     
0218~1401~0404       0.02222  2.0     | 0210~4005~0802       0.01111  1.0     
0218~1401~0407       0.01111  1.0     | 03012~4005~0802      0.01111  1.0     
0218~1401~0802       0.02222  2.0     | 0101~1401~0901       0.01111  1.0     
0218~35091~0901      0.01111  1.0     | 0101~1301~1101       0.01111  1.0     
0218~4005~1101       0.01111  1.0     | 0101~1520~1602       0.01111  1.0     
03012~1301~0402      0.03333  3.0     | 0218~1301~0402       0.01111  1.0     
03012~1401~0407      0.03333  3.0     | 0201~51013~0802      0.01111  1.0     
03012~1401~0802      0.01111  1.0     | 0201~39021~0901      0.01111  1.0     
03012~1520~0802      0.02222  2.0     | 2501~1520~0404       0.01111  1.0     
03012~39021~0402     0.02222  2.0     | 6814~5703~1602       0.01111  1.0     
03012~39021~0802     0.01111  1.0     | 0201~1401~0404       0.01111  1.0     
03012~4005~0802      0.01111  1.0     | 0201~1401~0407       0.01111  1.0     
03012~51013~0901     0.01111  1.0     | 3204~1301~1602       0.01111  1.0     
2501~1301~0802       0.02222  2.0     | 0218~1401~0407       0.01111  1.0     
2501~1401~0404       0.01111  1.0     | 0218~4005~1101       0.01111  1.0     
2501~1520~0404       0.01111  1.0     | 03012~1401~0802      0.01111  1.0     
2501~39021~0402      0.02222  2.0     | 03012~39021~0802     0.01111  1.0     
2501~4005~0802       0.02222  2.0     | 0101~39021~0404      0.01111  1.0     
2501~51013~0901      0.01111  1.0     | 0210~1401~0404       0.01111  1.0     
2501~5703~0802       0.01111  1.0     | 6814~4005~0407       0.01111  1.0     
2501~8101~0802       0.01111  1.0     | 0101~18012~0407      0.01111  1.0     
3204~1301~0901       0.01111  1.0     | 6901~1520~0402       0.01111  1.0     
3204~1301~1602       0.01111  1.0     | 0210~35091~0402      0.01111  1.0     
3204~1401~0802       0.02222  2.0     | 0210~39021~0802      0.01111  1.0     
3204~1520~0802       0.01111  1.0     | 03012~51013~0901     0.01111  1.0     
3204~39021~0802      0.02222  2.0     | 6901~67011~1602      0.01111  1.0     
3204~51013~1101      0.01111  1.0     | 0218~1401~0402       0.01111  1.0     
6814~1520~0407       0.01111  1.0     | 0101~67011~0802      0.01111  1.0     
6814~4005~0407       0.01111  1.0     | 2501~8101~0802       0.01111  1.0     
6814~5703~1602       0.01111  1.0     | 7403~39021~0407      0.01111  1.0     
6901~1301~0402       0.01111  1.0     | 6901~1301~0404       0.01111  1.0     
6901~1301~0404       0.01111  1.0     | 2501~5703~0802       0.01111  1.0     
6901~1520~0402       0.01111  1.0     | 2501~51013~0901      0.01111  1.0     
6901~67011~1602      0.01111  1.0     | 0201~5703~1101       0.01111  1.0     
7403~39021~0407      0.01111  1.0     | 3204~1520~0802       0.01111  1.0     


And here is an example of [Haplostats] results:

II. Multi-locus Analyses [haplo-stats]
======================================

Haplotype / linkage disequilibrium (LD) statistics
__________________________________________________

Pairwise LD estimates
---------------------
Locus pair            D      D'        Wn   ln(L_1)   ln(L_0)         S # permu p-value
A:B                   *0.516108  0.390722   -314.17                 NaN       - -      


Haplotype frequency est. for loci: A:B
--------------------------------------
Unique genotypes: 50
Number of haplotypes: 52
Loglikelihood under linkage equilibrium [ln(L_0)]: 
Loglikelihood obtained via the EM algorithm [ln(L_1)]: -314.173
Number of iterations before convergence: 

Haplotypes sorted by name             | Haplotypes sorted by frequency        
haplotype            frequency# copies| haplotype            frequency# copies
****~1301            0.0106382        | 0201~1401            0.1271851        
****~1401            0.0106382        | 0218~1401            0.0638297        
****~18012           0.0106382        | 03012~1401           0.0319148        
****~78021           0.0106382        | 0101~1301            0.0319148        
0101~1301            0.0319148        | 0210~1301            0.0319148        
0101~1401            0.0111127        | 2501~1301            0.0319148        
0101~1520            0.0212765        | 03012~1520           0.0319148        
0101~35091           0.0106382        | 03012~51013          0.0319148        
0101~39021           0.0208021        | 2501~39021           0.0319146        
0101~51013           0.0106382        | 3204~39021           0.0212768        
0101~8101            0.0212765        | 0101~1520            0.0212765        
0201~1301            0.0212765        | 0101~8101            0.0212765        
0201~1401            0.1271851        | 0201~1301            0.0212765        
0201~18012           0.0106382        | 0201~4005            0.0212765        
0201~39021           0.0111127        | 0210~4005            0.0212765        
0201~4005            0.0212765        | 0218~4005            0.0212765        
0201~5703            0.0106382        | 2501~51013           0.0212765        
0210~1301            0.0319148        | 03012~1301           0.0212765        
0210~1401            0.0212765        | 03012~39021          0.0212765        
0210~35091           0.0106382        | 3204~1401            0.0212765        
0210~39021           0.0106382        | 6901~1301            0.0212765        
0210~4005            0.0212765        | 0210~1401            0.0212765        
0210~51013           8.1821857        | 0101~39021           0.0208021        
0218~1301            0.0106382        | 0101~1401            0.0111127        
0218~1401            0.0638297        | 0201~39021           0.0111127        
0218~4005            0.0212765        | 0101~35091           0.0106382        
03012~1301           0.0212765        | 0101~51013           0.0106382        
03012~1401           0.0319148        | 0201~5703            0.0106382        
03012~1520           0.0319148        | 0201~18012           0.0106382        
03012~39021          0.0212765        | 0210~35091           0.0106382        
03012~4005           0.0106382        | 0210~39021           0.0106382        
03012~51013          0.0319148        | 0218~1301            0.0106382        
2501~1301            0.0319148        | 2501~1401            0.0106382        
2501~1401            0.0106382        | 2501~5703            0.0106382        
2501~39021           0.0319146        | 2501~67011           0.0106382        
2501~4005            2.9233526        | 03012~4005           0.0106382        
2501~51013           0.0212765        | 3204~1301            0.0106382        
2501~5703            0.0106382        | 3204~1520            0.0106382        
2501~67011           0.0106382        | 3204~78021           0.0106382        
3204~1301            0.0106382        | 6814~1520            0.0106382        
3204~1401            0.0212765        | 6814~4005            0.0106382        
3204~1520            0.0106382        | 6814~5703            0.0106382        
3204~39021           0.0212768        | 6901~18012           0.0106382        
3204~4005            0.0106380        | 6901~67011           0.0106382        
3204~78021           0.0106382        | 7403~39021           0.0106382        
6814~1520            0.0106382        | ****~1301            0.0106382        
6814~4005            0.0106382        | ****~1401            0.0106382        
6814~5703            0.0106382        | ****~18012           0.0106382        
6901~1301            0.0212765        | ****~78021           0.0106382        
6901~18012           0.0106382        | 3204~4005            0.0106380        
6901~67011           0.0106382        | 2501~4005            2.9233526        
7403~39021           0.0106382        | 0210~51013           8.1821857        

input files format

hi,
it's not clear to me what's the exact format needed by the software. I followed the same format given in the examples but i get this error when i try to run:
./bin/pypop.py -c /Users/afadda/Desktop/QG128.ini /Users/afadda/Desktop/QG128.pop
Traceback (most recent call last):
File "./bin/pypop.py", line 316, in
config = getConfigInstance(configFilename, altpath, usage_message)
File "/Applications/pypop/bin/../PyPop/Main.py", line 64, in getConfigInstance
config.read(configFilename)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ConfigParser.py", line 305, in read
self._read(fp, filename)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ConfigParser.py", line 546, in _read
raise e
ConfigParser.ParsingError: File contains parsing errors: /Users/afadda/Desktop/QG128.ini
[line 15]: '*A_1\n'
[line 16]: '*A_2\n'
[line 17]: '*B_1\n'
[line 18]: '*B_2\n'
[line 19]: '*C_1\n'
[line 20]: '*C_2\n'
[line 21]: '*DQA1_1\n'
[line 22]: '*DQA1_2\n'
[line 23]: '*DQB1_1\n'
[line 24]: '*DQB1_2\n'
[line 25]: '*DRB1_1\n'
[line 26]: '*DRB1_2\n'
[line 27]: '*DPA1_1\n'
[line 28]: '*DPA1_2\n'
[line 29]: '*DPB1_1\n'
[line 30]: '*DPB1_2\n'
[line 31]: '*DRB3_1\n'
[line 32]: '*DRB3_2\n'
[line 33]: '*DRB4_1\n'
[line 34]: '*DRB4_2\n'

my .ini file looks like this :
;; comment out or change as desired
;; 1 = true, 0 = false
;; see config.ini in main distribution for detailed explanation of options

[General]
debug=0
outFilePrefixType=filename

[ParseGenotypeFile]
alleleDesignator=*
untypedAllele=****
;;fieldPairDesignator=_1:_2

validSampleFields=+id
*A_1
*A_2
*B_1
*B_2
*C_1
*C_2
*DQA1_1
*DQA1_2
*DQB1_1
*DQB1_2
*DRB1_1
*DRB1_2
*DPA1_1
*DPA1_2
*DPB1_1
*DPB1_2
*DRB3_1
*DRB3_2
*DRB4_1
*DRB4_2

[HardyWeinberg]
lumpBelow=5

[HardyWeinbergGuoThompson]
dememorizationSteps=2000
samplingNum=1000
samplingSize=1000

[HomozygosityEWSlatkinExact]
numReplicates=10000

[Emhaplofreq]
allPairwiseLD=1
allPairwiseLDWithPermu=0

thanks,

Need space to accommodate longer colon-delimited allele names in Single Locus Summary outputs

When colon-delimited four-field allele names are analyzed, they are being truncated in the "Allele Counts" sections of the *out.txt files.

It doesn't matter what .ini file is run, but for example, ./bin/pypop.py -c WS_BDCtrl_Test_EM.ini BIGDAWG_SynthControl_Data.pop will return this data for HLA-A:

1.1. Allele Counts [A]
----------------------
Untyped individuals: 0.0
Sample Size (n): 1002.0
Allele Count (2n): 2004
Distinct alleles (k): 22

Counts ordered by frequency   | Counts ordered by name        
Name      Frequency (Count)   | Name      Frequency (Count)   
32:02     0.10130   203       | 01:01:01:00.08283   166       
68:06     0.08433   169       | 02:01:01:00.02894   58        
01:01:01:00.08283   166       | 02:05:01  0.07086   142       
24:02:01:00.07834   157       | 03:01:01:00.07236   145       
03:01:01:00.07236   145       | 03:01:03  0.05389   108       
02:05:01  0.07086   142       | 11:01:01:00.02944   59        
26:08     0.05539   111       | 11:01:01:00.02944   59        
03:01:03  0.05389   108       | 23:01:01  0.02246   45        
29:02:01:00.05389   108       | 24:02:01:00.07834   157       
26:01:01  0.04790   96        | 24:02:04  0.02894   58        
31:01:02:00.04790   96        | 25:01:01  0.01896   38        
68:01:01  0.03992   80        | 26:01:01  0.04790   96        
11:01:01:00.02944   59        | 26:08     0.05539   111       
11:01:01:00.02944   59        | 29:01:01:00.02894   58        
02:01:01:00.02894   58        | 29:02:01:00.05389   108       
24:02:04  0.02894   58        | 31:01:02:00.04790   96        
29:01:01:00.02894   58        | 31:01:02:00.00449   9         
23:01:01  0.02246   45        | 32:01:01  0.01497   30        
25:01:01  0.01896   38        | 32:02     0.10130   203       
32:01:01  0.01497   30        | 33:01:01  0.00449   9         
31:01:02:00.00449   9         | 68:01:01  0.03992   80        
33:01:01  0.00449   9         | 68:06     0.08433   169       
Total     1.00000   2004      | Total     1.00000   2004     

It looks like 9 characters are allowed for an allele name, which was fine when allele-names were digit delimited, as the maximum allele name length was 9 characters. However, with colon-delimited names, any allele name is at least three characters longer (01010201N becomes 01:01:02:01N). In addition, colon-delimiting means that each field can have any number of characters.

As of the current IPD-IMGT/HLA Database release (3.29.0), the longest allele name is still only 12 characters long (e.g., 104:01:01:01 and 06:127:01:01), but if an expression variant version of an allele with a name like these is found, the longest allele name would be 13 characters long.

So, it seems like at least 14 characters need to be dedicated to the allele-name space for this section, and possibly it might be good to be able to arbitrarily increase the allele-name space based on the length of the alleles being reported for a given locus in a given PyPop run.

Need proper documentation for Popmeta

Currently, in the PyPop docs, Popmeta is undocumented (as of PyPop vs 0.5.1). Some true documentation for Popmeta would be a useful addition to the PyPop docs (or possibly a separate document). I can contribute to the popmeta docs.

Error getting PyPop to run on a Mac

I've followed the install instructions, but I am getting the following error with Numeric when trying to run pypop on a Mac running MacOS 10.01.5:

:pypop sjmack$ ./pypop.py
Traceback (most recent call last):
File "./pypop.py", line 136, in
from Main import getUserFilenameInput, checkXSLFile
File "/Applications/PyPop/pypop/Main.py", line 41, in
from ParseFile import ParseGenotypeFile, ParseAlleleCountFile
File "/Applications/PyPop/pypop/ParseFile.py", line 44, in
from Utils import getStreamType, StringMatrix, OrderedDict, TextOutputStream
File "/Applications/PyPop/pypop/Utils.py", line 43, in
import Numeric
ImportError: No module named Numeric

clarify `pip` installation instructions and use of `--user`

Describe the bug

Should files be installing in appdata (see warnings below)?
Should the install create a directory called pypopgen in my home folder on Windows 11? It does not appear to.

To Reproduce

miniconda3\condabin\conda create -n pypop_testing_07022023 python=3

C:\Users\mmari>miniconda3\condabin\conda activate pypop_testing_07022023

pip install --user pypopgen -f https://github.com/alexlancaster/pypop/releases/expanded_assets/v1.0.0-a23

Looking in links: https://github.com/alexlancaster/pypop/releases/expanded_assets/v1.0.0-a23
Collecting pypopgen
Downloading https://github.com/alexlancaster/pypop/releases/download/v1.0.0-a23/pypopgen-1.0.0a23-cp311-cp311-win_amd64.whl (282 kB)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 282.2/282.2 kB 1.2 MB/s eta 0:00:00
Collecting numpy (from pypopgen)
Downloading numpy-1.25.0-cp311-cp311-win_amd64.whl (15.0 MB)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 15.0/15.0 MB 34.4 MB/s eta 0:00:00
Collecting lxml (from pypopgen)
Downloading lxml-4.9.2-cp311-cp311-win_amd64.whl (3.8 MB)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 3.8/3.8 MB 48.1 MB/s eta 0:00:00
Collecting psutil (from pypopgen)
Downloading psutil-5.9.5-cp36-abi3-win_amd64.whl (255 kB)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 255.1/255.1 kB ? eta 0:00:00
Installing collected packages: psutil, numpy, lxml, pypopgen
WARNING: The script f2py.exe is installed in 'C:\Users\mmari\AppData\Roaming\Python\Python311\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts popmeta.exe, pypop-interactive.exe and pypop.exe are installed in 'C:\Users\mmari\AppData\Roaming\Python\Python311\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed lxml-4.9.2 numpy-1.25.0 psutil-5.9.5 pypopgen-1.0.0a23

Expected behavior
Not sure

Desktop (please complete the following information):
Windows 11

allele drop out - question

Marcelo asked:
Do you have a standard way to represent locus that had failure, "dropout" in the new version of the software? Just leaving as empty would work? Every HLA typing software has different way of reporting dropout. MiaFora software reports as "Insufficient data".

Place dat files in specified directory

Is it possible to relocate the dat files to the folder where the ini and pop files are located for instance here are my list of commands:

./akkornel-pypop-using_docker-rev1988.img pypop โ€“-generate-tsv โ€“c sample/sample.ini โ€“o sample sample/sample.pop

Once the pypop run is completed the sample-out.txt and sample-out.xml files are located in the directory โ€˜sampleโ€™ i.e. /home/lcreary/sample but the .dat files are located in /home/lcreary. I also tried these commands and still the dat files end up in /home/lcreary

./akkornel-pypop-using_docker-rev1988.img pypop โ€“-generate-tsv -o sample โ€“c sample/sample.ini โ€“o sample sample/sample.pop

HWP output issues: missing common genotypes and incorrect Guo & Thompson stats

Transferring from issue #4 comment #4 (comment) originally by @sjmack:

However, I have constructed a test data file (the controls from the BIGDAWG synthetic datafile) that reveals several issues with the current HW implementations (vs version 0.7.0).

I'm attaching two versions of this test file:
BIGDAWG_SynthControl_Data.pop.txt
and
BIGDAWG_SynthControl_Data_dash.pop.txt
And the associated .ini file:
WS_BDCtrl_Test_HW.ini.txt
Be sure to remove the .txt suffices.

The difference between the two datasets is that the *_dash.pop file has the colons converted to dashes. I did this so that I could compare the current developmental version of PyPop on my Mac to v0.7.0 running on my PC. I could only run the *_dash.pop file on my PC.

I have three set of results. The git. versions were generated using this development version of PyPop, and the 070. versions with the current release version.

First of all, you will notice that there are no Common Genotypes being generated with the development version. The results below show the dash datasets, but the same happens when colons are included for the developmental version.

Compare (Git):

------------------------------------------------------------------------------
Common heterozygotes by allele
       01-01-01-01         152      152.25        0.00        0.9839      
       02-01-01-01          56       56.32        0.00        0.9658      
          02-05-01         132      131.94        0.00        0.9957      
       03-01-01-01         135      134.51        0.00        0.9662      
          03-01-03         102      102.18        0.00        0.9858      
       11-01-01-01          57       57.26        0.00        0.9723      
       11-01-01-02          57       57.26        0.00        0.9723      
          23-01-01          43       43.99        0.02        0.8814      
       24-02-01-01         145      144.70        0.00        0.9801      
          24-02-04          56       56.32        0.00        0.9658      
          25-01-01          38       37.28        0.01        0.9061      
          26-01-01          92       91.40        0.00        0.9501      
             26-08         105      104.85        0.00        0.9885      
       29-01-01-01          56       56.32        0.00        0.9658      
       29-02-01-02         102      102.18        0.00        0.9858      
       31-01-02-01          92       91.40        0.00        0.9501      
       31-01-02-02           9        8.96        0.00        0.9892      
          32-01-01          30       29.55        0.01        0.9342      
             32-02         183      182.44        0.00        0.9667      
          33-01-01           9        8.96        0.00        0.9892      
          68-01-01          76       76.81        0.01        0.9267      
             68-06         155      154.75        0.00        0.9838      

------------------------------------------------------------------------------
Common genotypes
             Total           0        0.00
------------------------------------------------------------------------------

With (v0.7.0):

------------------------------------------------------------------------------
Common heterozygotes by allele
       01-01-01-01         152      152.25        0.00        0.9839      
       02-01-01-01          56       56.32        0.00        0.9658      
          02-05-01         132      131.94        0.00        0.9957      
       03-01-01-01         135      134.51        0.00        0.9662      
          03-01-03         102      102.18        0.00        0.9858      
       11-01-01-01          57       57.26        0.00        0.9723      
       11-01-01-02          57       57.26        0.00        0.9723      
          23-01-01          43       43.99        0.02        0.8814      
       24-02-01-01         145      144.70        0.00        0.9801      
          24-02-04          56       56.32        0.00        0.9658      
          25-01-01          38       37.28        0.01        0.9061      
          26-01-01          92       91.40        0.00        0.9501      
             26-08         105      104.85        0.00        0.9885      
       29-01-01-01          56       56.32        0.00        0.9658      
       29-02-01-02         102      102.18        0.00        0.9858      
       31-01-02-01          92       91.40        0.00        0.9501      
       31-01-02-02           9        8.96        0.00        0.9892      
          32-01-01          30       29.55        0.01        0.9342      
             32-02         183      182.44        0.00        0.9667      
          33-01-01           9        8.96        0.00        0.9892      
          68-01-01          76       76.81        0.01        0.9267      
             68-06         155      154.75        0.00        0.9838      

------------------------------------------------------------------------------
Common genotypes
-01-01:01-01-01-01           7        6.88        0.00        0.9621      
-01-01-01:02-05-01          11       11.76        0.05        0.8241      
-01-01:03-01-01-01          12       12.01        0.00        0.9975      
-01-01-01:03-01-03           9        8.95        0.00        0.9856      
-01-01:24-02-01-01          13       13.00        0.00        0.9989      
-01-01-01:26-01-01           8        7.95        0.00        0.9864      
 01-01-01-01:26-08           9        9.19        0.00        0.9488      
-01-01:29-02-01-02           9        8.95        0.00        0.9856      
-01-01:31-01-02-01           8        7.95        0.00        0.9864      
 01-01-01-01:32-02          16       16.82        0.04        0.8424      
-01-01-01:68-01-01           7        6.63        0.02        0.8847      
 01-01-01-01:68-06          14       14.00        0.00        0.9998      
 02-01-01-01:32-02           6        5.88        0.00        0.9590      
 02-05-01:02-05-01           5        5.03        0.00        0.9890      
-05-01:03-01-01-01          10       10.27        0.01        0.9318      
 02-05-01:03-01-03           8        7.65        0.02        0.9001      
-05-01:24-02-01-01          11       11.12        0.00        0.9702      
 02-05-01:26-01-01           7        6.80        0.01        0.9396      
    02-05-01:26-08           8        7.87        0.00        0.9617      
-05-01:29-02-01-02           8        7.65        0.02        0.9001      
-05-01:31-01-02-01           7        6.80        0.01        0.9396      
    02-05-01:32-02          14       14.38        0.01        0.9193      
 02-05-01:68-01-01           6        5.67        0.02        0.8893      
    02-05-01:68-06          12       11.98        0.00        0.9942      
-01-01:03-01-01-01           5        5.25        0.01        0.9146      
-01-01-01:03-01-03           8        7.81        0.00        0.9471      
-01-01:24-02-01-01          12       11.36        0.04        0.8493      
-01-01-01:26-01-01           7        6.95        0.00        0.9837      
 03-01-01-01:26-08           8        8.03        0.00        0.9911      
-01-01:29-02-01-02           8        7.81        0.00        0.9471      
-01-01:31-01-02-01           7        6.95        0.00        0.9837      
 03-01-01-01:32-02          15       14.69        0.01        0.9351      
-01-01-01:68-01-01           6        5.79        0.01        0.9299      
 03-01-01-01:68-06          12       12.23        0.00        0.9480      
-01-03:24-02-01-01           8        8.46        0.03        0.8741      
 03-01-03:26-01-01           5        5.17        0.01        0.9391      
    03-01-03:26-08           6        5.98        0.00        0.9941      
-01-03:29-02-01-02           6        5.82        0.01        0.9406      
-01-03:31-01-02-01           5        5.17        0.01        0.9391      
    03-01-03:32-02          11       10.94        0.00        0.9856      
    03-01-03:68-06           9        9.11        0.00        0.9715      
 11-01-01-01:32-02           6        5.98        0.00        0.9923      
 11-01-01-02:32-02           6        5.98        0.00        0.9923      
-01-01:24-02-01-01           6        6.15        0.00        0.9518      
-02-01-01:26-01-01           8        7.52        0.03        0.8613      
 24-02-01-01:26-08           9        8.70        0.01        0.9179      
-01-01:29-02-01-02           8        8.46        0.03        0.8741      
-01-01:31-01-02-01           8        7.52        0.03        0.8613      
 24-02-01-01:32-02          16       15.90        0.00        0.9807      
-02-01-01:68-01-01           6        6.27        0.01        0.9149      
 24-02-01-01:68-06          13       13.24        0.00        0.9474      
    24-02-04:32-02           6        5.88        0.00        0.9590      
    26-01-01:26-08           5        5.32        0.02        0.8905      
-01-01:29-02-01-02           5        5.17        0.01        0.9391      
    26-01-01:32-02          10        9.72        0.01        0.9296      
    26-01-01:68-06           8        8.10        0.00        0.9731      
 26-08:29-02-01-02           6        5.98        0.00        0.9941      
 26-08:31-01-02-01           5        5.32        0.02        0.8905      
       26-08:32-02          11       11.24        0.01        0.9420      
       26-08:68-06           9        9.36        0.01        0.9061      
 29-01-01-01:32-02           6        5.88        0.00        0.9590      
-01-02:31-01-02-01           5        5.17        0.01        0.9391      
 29-02-01-02:32-02          11       10.94        0.00        0.9856      
 29-02-01-02:68-06           9        9.11        0.00        0.9715      
 31-01-02-01:32-02          10        9.72        0.01        0.9296      
 31-01-02-01:68-06           8        8.10        0.00        0.9731      
       32-02:32-02          10       10.28        0.01        0.9300      
    32-02:68-01-01           8        8.10        0.00        0.9709      
       32-02:68-06          17       17.12        0.00        0.9770      
    68-01-01:68-06           7        6.75        0.01        0.9223      
       68-06:68-06           7        7.13        0.00        0.9624      
             Total         612      612.81
------------------------------------------------------------------------------

In addition, the stats being reported for the developmental version include errors; especially for the Chen and Diff tests, where obs and exp values are 0. The differences in the p-values for the mcmc results probably stem from the Markov-Chain, but I only did each one once, so I'm not certain.

Compare (Git):

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12+03-02-12 (0/0.000000) 0.0007*** 0.0007***
05-03-01-01+03-02-12 (0/0.000000) 0.0000***** 0.0000*****
05-03-01-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+06-02-01 (0/0.000000) 0.0001*** 0.0001***
06-03-01+04-02-01 (0/0.000000) 0.0000***** 0.0000*****
06-05-01+05-03-01-01 (0/0.000000) 0.0000**** 0.0000*****
06-05-01+06-05-01 (0/0.000000) 0.0049** 0.0049**


3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12+03-02-12 (0/0.000000) 0.0017** 0.0017**
05-03-01-01+03-02-12 (0/0.000000) 0.0000***** 0.0000*****
05-03-01-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+06-02-01 (0/0.000000) 0.0007*** 0.0007***
06-03-01+04-02-01 (0/0.000000) 0.0000***** 0.0000*****
06-05-01+05-03-01-01 (0/0.000000) 0.0000**** 0.0000****
06-05-01+06-05-01 (0/0.000000) 0.0016** 0.0016**

to (version 0.7.0):

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12:03-02-12:  (18/8.818363) 0.0017** 0.0017**
05-03-01-01:03-02-12:  (0/18.105788) 0.0000***** 0.0000*****
05-03-01-01:05-03-01-01:  (39/9.293663) 0.0000***** 0.0000*****
06-02-01:05-03-01-01:  (0/23.499002) 0.0000***** 0.0000*****
06-02-01:06-02-01:  (27/14.854291) 0.0000***** 0.0000*****
06-03-01:04-02-01:  (8/0.287425) 0.0000***** 0.0000*****
06-05-01:05-03-01-01:  (0/18.105788) 0.0000***** 0.0000*****
06-05-01:06-05-01:  (18/8.818363) 0.0011** 0.0011**


3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12:03-02-12:  (18/8.818363) 0.0017** 0.0017**
05-03-01-01:03-02-12:  (0/18.105788) 0.0000***** 0.0000*****
05-03-01-01:05-03-01-01:  (39/9.293663) 0.0000***** 0.0000*****
06-02-01:05-03-01-01:  (0/23.499002) 0.0000***** 0.0000*****
06-02-01:06-02-01:  (27/14.854291) 0.0007*** 0.0007***
06-03-01:04-02-01:  (8/0.287425) 0.0000***** 0.0000*****
06-05-01:05-03-01-01:  (0/18.105788) 0.0000**** 0.0000****
06-05-01:06-05-01:  (18/8.818363) 0.0016** 0.0016**

Here are the results:

BIGDAWG_SynthControl_Data-out.git.txt
BIGDAWG_SynthControl_Data-out.git.xml.txt
BIGDAWG_SynthControl_Data_dash-out.git.txt
BIGDAWG_SynthControl_Data_dash-out.git.xml.txt
BIGDAWG_SynthControl_Data_dash-out.070.txt
BIGDAWG_SynthControl_Data_dash-out.070.xml.txt

Again, remove the .txt from the .XML filenames.

Popmeta: 2-locus haplotype results are not being reported in .TSV files

Popmeta does not seem to be generating 2-locus haplotype results files after [Emhaplofreq] has generated data.

I am using stripped down version of the WS_BDCtrl_Test_EM.ini file (WS_BDCtrl_Test_EM_lite.ini) that [Emhaplofreq] set to either LociToEstHaplo=a:drb1 or LociToEstHaplo=a:drb1:drb3

;; comment out or change as desired
;; 1 = true, 0 = false

[General]
debug=0

[ParseGenotypeFile]
untypedAllele=****

;; designates field name that holds population name
popNameDesignator=+

;; designates field name that holds allele data
alleleDesignator=*

;; valid fields for sample data block
validSampleFields=SampleID
 +Disease
 *A_1
 *A_2
; *B_1
; *B_2
; *C_1
; *C_2
; *DPA1_1
; *DPA1_2
; *DPB1_1
; *DPB1_2
; *DQA1_1
; *DQA1_2
; *DQB1_1
; *DQB1_2
 *DRB1_1
 *DRB1_2
 *DRB3_1
 *DRB3_2
; *DRB4_1
; *DRB4_2
; *DRB5_1
; *DRB5_2
; *HLA-A_1
; *HLA-A_2
; *HLA-B_1
; *HLA-B_2
; *HLA-C_1
; *HLA-C_2
; *HLA-DPA1_1
; *HLA-DPA1_2
; *HLA-DPB1_1
; *HLA-DPB1_2
; *HLA-DQA1_1
; *HLA-DQA1_2
; *HLA-DQB1_1
; *HLA-DQB1_2
; *HLA-DRB1_1
; *HLA-DRB1_2
; *HLA-DRB3_1
; *HLA-DRB3_2

[Emhaplofreq] 

;; comma (',') separated haplotypes blocks for which to estimate
;; haplotypes, within each "block", each locus is separated by colons
;; (':') e.g. dqa1:dpb1,drb1:dqb1, means to est. of haplotypes for
;; 'dqa1' and 'dpb1' loci followed by est. of haplotypes for 'drb1'
;; and 'dqb1' loci.  A wildcard entry '*' means estimate haplotypes
;; for the entire loci as specified in the original file column order
LociToEstHaplo=a:drb1:drb3

When I have [Emhaplofreq] set to LociToEstHaplo=a:drb1, here are the sizes of the resulting .dat files:

:pypop sjmack$ ./bin/pypop.py --generate-tsv -c WS_BDCtrl_Test_EM_lite.ini BIGDAWG_SynthControl_Data.pop
LOG: Data file has no header data block
LOG: estimating haplotype frequencies for specific haplotypes: [a:drb1]
Generating TSV (.dat) files...
:pypop sjmack$ ls -al *.dat
-rw-r--r--  1 sjmack  admin  3084 Jul 26 14:18 1-locus-allele.dat
-rw-r--r--  1 sjmack  admin   358 Jul 26 14:18 1-locus-genotype.dat
-rw-r--r--  1 sjmack  admin   302 Jul 26 14:18 1-locus-hardyweinberg.dat
-rw-r--r--  1 sjmack  admin   104 Jul 26 14:18 1-locus-pairwise-fnd.dat
-rw-r--r--  1 sjmack  admin   939 Jul 26 14:18 1-locus-summary.dat
-rw-r--r--  1 sjmack  admin   146 Jul 26 14:18 2-locus-haplo.dat
-rw-r--r--  1 sjmack  admin   144 Jul 26 14:18 2-locus-summary.dat
-rw-r--r--  1 sjmack  admin   105 Jul 26 14:18 3-locus-haplo.dat
-rw-r--r--  1 sjmack  admin    31 Jul 26 14:18 3-locus-summary.dat
-rw-r--r--  1 sjmack  admin   105 Jul 26 14:18 4-locus-haplo.dat
-rw-r--r--  1 sjmack  admin   105 Jul 26 14:18 4-locus-summary.dat

None of the 2- or 3-locus TSV files have any data in them:

:pypop sjmack$ more 2-locus-summary.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex n.gametes       locus1  locus2  metaloci        ld.dprime       ld.wn   q.chisq q.df    lrt.pval        lrt.z
:pypop sjmack$ more 2-locus-haplo.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count    ld.d    ld.dprime       ld.chisq        obs     obs.freq        exp
:pypop sjmack$ more 3-locus-haplo.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count
:pypop sjmack$ more 3-locus-summary.dat 
n.gametes       locus1  locus2  locus3
:pypop sjmack$ 

However, 2-locus results are generated in the BIGDAWG_SynthControl_Data-out.txt and *-out.xml files:

II. Multi-locus Analyses
========================

Haplotype / linkage disequilibrium (LD) statistics
__________________________________________________

Pairwise LD estimates
---------------------
Locus pair            D      D'        Wn   ln(L_1)   ln(L_0)         S # permu p-value


Haplotype frequency est. for loci: A:DRB1
-----------------------------------------
Number of individuals: 1002 (before-filtering)
Number of individuals: 991 (after-filtering)
Unique phenotypes: 303
Unique genotypes: 556
Number of haplotypes: 263
Loglikelihood under linkage equilibrium [ln(L_0)]: -7703.697189
Loglikelihood obtained via the EM algorithm [ln(L_1)]: -7061.268470
Number of iterations before convergence: 153

Haplotypes sorted by frequency        
haplotype                                 frequency# copies
01:01:01:01~01:01:01                      0.08110  160.7   
24:02:01:01~07:01:01:01                   0.07921  157.0   
32:02~03:01:02                            0.05701  113.0   
03:01:03~15:01:01:01                      0.05449  108.0   
31:01:02:01~03:01:01:01                   0.04338  86.0    
68:01:01~11:01:01                         0.04036  80.0    
68:06~15:01:01:01                         0.03075  61.0    
32:02~15:01:01:01                         0.03008  59.6    
11:01:01:01~04:01:01                      0.02977  59.0    
26:01:01~11:04:01                         0.02725  54.0    
02:05:01~13:01:01                         0.02371  47.0    
26:08~08:01:03                            0.02325  46.1    
03:01:01:01~08:01:03                      0.02288  45.3    
02:05:01~08:01:03                         0.01959  38.8    
03:01:01:01~15:01:01:01                   0.01863  36.9    
02:05:01~15:01:01:01                      0.01507  29.9    
68:06~08:01:03                            0.01395  27.6    
29:02:01:02~08:01:03                      0.01379  27.3    
.
.
.

and

<haplotypefreq>
<loginfo><![CDATA[
Percent of iterations with error_flag = 0: 100.000
Percent of iterations with error_flag = 2:   0.000
Percent of iterations with error_flag = 3:   0.000
Percent of iterations with error_flag = 4:   0.000
Percent of iterations with error_flag = 5:   0.000
Percent of iterations with error_flag = 6:   0.000
Percent of iterations with error_flag = 7:   0.000

--- Codes for error_flag ----------------------------------------------------------
0: Iterations Converged, no errors 
2: Normalization constant near zero. Est. HFs unstable 
3: Wrong # allocated for at least one phenotype based on est. HFs 
4: Phenotype freq., based on est. HFs, is 0 for an observed phenotype 
5: Log likelihood has decreased for more than 5 iterations 
6: Est. HFs do not sum to 1.0 
7: Log likelihood failed to converge in 400 iterations 
-----------------------------------------------------------------------------------

]]></loginfo>
<condition role="converged"/>
<iterConverged>114</iterConverged><loglikelihood>-7061.268462</loglikelihood>
<haplotype name="01:01:01:01~01:01:01"><frequency>0.08110</frequency><numCopies>160.7</numCopies></haplotype>
<haplotype name="24:02:01:01~07:01:01:01"><frequency>0.07921</frequency><numCopies>157.0</numCopies></haplotype>
<haplotype name="32:02~03:01:02"><frequency>0.05701</frequency><numCopies>113.0</numCopies></haplotype>
<haplotype name="03:01:03~15:01:01:01"><frequency>0.05449</frequency><numCopies>108.0</numCopies></haplotype>
<haplotype name="31:01:02:01~03:01:01:01"><frequency>0.04338</frequency><numCopies>86.0</numCopies></haplotype>
<haplotype name="68:01:01~11:01:01"><frequency>0.04036</frequency><numCopies>80.0</numCopies></haplotype>
<haplotype name="68:06~15:01:01:01"><frequency>0.03075</frequency><numCopies>61.0</numCopies></haplotype>
<haplotype name="32:02~15:01:01:01"><frequency>0.03008</frequency><numCopies>59.6</numCopies></haplotype>
<haplotype name="11:01:01:01~04:01:01"><frequency>0.02977</frequency><numCopies>59.0</numCopies></haplotype>
<haplotype name="26:01:01~11:04:01"><frequency>0.02725</frequency><numCopies>54.0</numCopies></haplotype>
<haplotype name="02:05:01~13:01:01"><frequency>0.02371</frequency><numCopies>47.0</numCopies></haplotype>
<haplotype name="26:08~08:01:03"><frequency>0.02325</frequency><numCopies>46.1</numCopies></haplotype>
<haplotype name="03:01:01:01~08:01:03"><frequency>0.02288</frequency><numCopies>45.3</numCopies></haplotype>
<haplotype name="02:05:01~08:01:03"><frequency>0.01959</frequency><numCopies>38.8</numCopies></haplotype>
<haplotype name="03:01:01:01~15:01:01:01"><frequency>0.01863</frequency><numCopies>36.9</numCopies></haplotype>
.
.
.

When [Emhaplofreq] is set to LociToEstHaplo=a:drb1:drb3 the 3-locus TSV files are populated.

./bin/pypop.py --generate-tsv -c WS_BDCtrl_Test_EM_lite.ini BIGDAWG_SynthControl_Data.pop
LOG: Data file has no header data block
LOG: estimating haplotype frequencies for specific haplotypes: [a:drb1:drb3]
Generating TSV (.dat) files...
:pypop sjmack$ ls -al *.dat
-rw-r--r--  1 sjmack  admin   3084 Jul 26 13:45 1-locus-allele.dat
-rw-r--r--  1 sjmack  admin    358 Jul 26 13:45 1-locus-genotype.dat
-rw-r--r--  1 sjmack  admin    302 Jul 26 13:45 1-locus-hardyweinberg.dat
-rw-r--r--  1 sjmack  admin    104 Jul 26 13:45 1-locus-pairwise-fnd.dat
-rw-r--r--  1 sjmack  admin    939 Jul 26 13:45 1-locus-summary.dat
-rw-r--r--  1 sjmack  admin    146 Jul 26 13:45 2-locus-haplo.dat
-rw-r--r--  1 sjmack  admin    144 Jul 26 13:45 2-locus-summary.dat
-rw-r--r--  1 sjmack  admin  10281 Jul 26 13:45 3-locus-haplo.dat
-rw-r--r--  1 sjmack  admin     96 Jul 26 13:45 3-locus-summary.dat
-rw-r--r--  1 sjmack  admin    105 Jul 26 13:45 4-locus-haplo.dat
-rw-r--r--  1 sjmack  admin    105 Jul 26 13:45 4-locus-summary.dat

Here are the the contents of the pertinent files:

:pypop sjmack$ more 2-locus-summary.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex n.gametes       locus1  locus2  metaloci        ld.dprime       ld.wn   q.chisq q.df    lrt.pval        lrt.z
:pypop sjmack$ more 2-locus-haplo.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count    ld.d    ld.dprime       ld.chisq        obs     obs.freq        exp
:pypop sjmack$ more 3-locus-summary.dat 
n.gametes       locus1  locus2  locus3
0       ****    ****    ****    ****    ****    ****    ****    ****    1982    DRB3    A       DRB1    ****    
:pypop sjmack$ more 3-locus-haplo.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     01:01:01:01~01:01:01~00:00      0.08110 160.7   
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     24:02:01:01~07:01:01:01~00:00   0.07921 157.0   
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     32:02~03:01:02~01:01:02:01      0.05701 113.0   
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     03:01:03~15:01:01:01~00:00      0.05449 108.0   
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     31:01:02:01~03:01:01:01~01:01:02:02     0.04338 86.0    
0       ****    ****    ****    ****    ****    ****    ****    ****    DRB3:A:DRB1     68:01:01~11:01:01~03:01:01      0.04036 80.0    
.
.
.

I'm not exactly sure where the problem lies. However, if I add allPairwiseLD=1 to the .ini, the log includes these two lines:

LOG: estimating all pairwise LD: with no permutation test
LOG: estimating haplotype frequencies for all two locus haplotypes, specific haplotypes: [a:drb1:drb3]

and the 2-locus TSV files are now full of data:

:pypop sjmack$ ls -al *.dat
-rw-r--r--  1 sjmack  admin   3084 Jul 26 14:28 1-locus-allele.dat
-rw-r--r--  1 sjmack  admin    358 Jul 26 14:28 1-locus-genotype.dat
-rw-r--r--  1 sjmack  admin    302 Jul 26 14:28 1-locus-hardyweinberg.dat
-rw-r--r--  1 sjmack  admin    104 Jul 26 14:28 1-locus-pairwise-fnd.dat
-rw-r--r--  1 sjmack  admin    939 Jul 26 14:28 1-locus-summary.dat
-rw-r--r--  1 sjmack  admin  21352 Jul 26 14:28 2-locus-haplo.dat
-rw-r--r--  1 sjmack  admin    453 Jul 26 14:28 2-locus-summary.dat
-rw-r--r--  1 sjmack  admin  10281 Jul 26 14:28 3-locus-haplo.dat
-rw-r--r--  1 sjmack  admin     96 Jul 26 14:28 3-locus-summary.dat
-rw-r--r--  1 sjmack  admin    105 Jul 26 14:28 4-locus-haplo.dat
-rw-r--r--  1 sjmack  admin    105 Jul 26 14:28 4-locus-summary.dat
:pypop sjmack$ more 2-locus-summary.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex n.gametes       locus1  locus2  metaloci        ld.dprime       ld.wn   q.chisq q.df    lrt.pval        lrt.z
0       ****    ****    ****    ****    ****    ****    ****    ****    1982    A       DRB1    ****    1.13712 2.45931 179813.51153    315     ****    ****    
0       ****    ****    ****    ****    ****    ****    ****    ****    1982    DRB3    A       ****    3.05593 1.80338 32228.97154     105     ****    ****    
0       ****    ****    ****    ****    ****    ****    ****    ****    1982    DRB3    DRB1    ****    2.95178 1.42168 20029.71037     75      ****    ****    

and

:pypop sjmack$ more 2-locus-haplo.dat 
pop     labcode method  ethnic  collect.site    region  latit   longit  complex locus   allele  allele.freq     allele.count    ld.d    ld.dprime       ld.chisq        obs     obs.freq        exp
0       ****    ****    ****    ****    ****    ****    ****    ****    A:DRB1  01:01:01:01~01:01:01    0.08110 160.7   ****    ****    ****    ****    0.08110 ****    
0       ****    ****    ****    ****    ****    ****    ****    ****    A:DRB1  24:02:01:01~07:01:01:01 0.07921 157.0   ****    ****    ****    ****    0.07921 ****    
0       ****    ****    ****    ****    ****    ****    ****    ****    A:DRB1  32:02~03:01:02  0.05701 113.0   ****    ****    ****    ****    0.05701 ****    
0       ****    ****    ****    ****    ****    ****    ****    ****    A:DRB1  03:01:03~15:01:01:01    0.05449 108.0   ****    ****    ****    ****    0.05449 ****    
0       ****    ****    ****    ****    ****    ****    ****    ****    A:DRB1  31:01:02:01~03:01:01:01 0.04338 86.0    ****    ****    ****    ****    0.04338 ****    
.
.
.

However, the *-out.txt file only includes the haplogtypes that were explicitly specified in the .ini file:

II. Multi-locus Analyses
========================

Haplotype / linkage disequilibrium (LD) statistics
__________________________________________________

Pairwise LD estimates
---------------------
Locus pair            D      D'        Wn   ln(L_1)   ln(L_0)         S # permu p-value
A:DRB1          0.01765 1.13712   2.45931  -7061.27  -7703.70   1284.86       - -      
A:DRB3          0.08141 3.05593   1.80338  -6055.85  -4657.18  -2797.34       - -      
DRB1:DRB3       0.13159 2.95178   1.42168  -4001.16  -3819.91   -362.50       - -      


Haplotype frequency est. for loci: A:DRB1:DRB3
----------------------------------------------
Number of individuals: 1002 (before-filtering)
Number of individuals: 991 (after-filtering)
Unique phenotypes: 303
Unique genotypes: 907
Number of haplotypes: 618
Loglikelihood under linkage equilibrium [ln(L_0)]: -8090.394069
Loglikelihood obtained via the EM algorithm [ln(L_1)]: -7061.268470
Number of iterations before convergence: 82

Haplotypes sorted by frequency        
haplotype                                 frequency# copies
01:01:01:01~01:01:01~00:00                0.08110  160.7   
24:02:01:01~07:01:01:01~00:00             0.07921  157.0   
32:02~03:01:02~01:01:02:01                0.05701  113.0   
03:01:03~15:01:01:01~00:00                0.05449  108.0   

I think that that latter is fine (not including the 2-locus haplotypes in the *-out.txt if they weren't explicitly specified in the .ini), but popmeta seems to be ignoring 2-locus haplotypes if allPairwiseLD is not set to 1.

When I have [Emhaplofreq] set to LociToEstHaplo=a:drb1 and allPairwiseLD=1, the -out.xml file has a different structure to the <haplotypefreq> element that popmeta may be looking for:

<haplotypefreq>
<loginfo><![CDATA[
Percent of iterations with error_flag = 0: 100.000
Percent of iterations with error_flag = 2:   0.000
Percent of iterations with error_flag = 3:   0.000
Percent of iterations with error_flag = 4:   0.000
Percent of iterations with error_flag = 5:   0.000
Percent of iterations with error_flag = 6:   0.000
Percent of iterations with error_flag = 7:   0.000

--- Codes for error_flag ----------------------------------------------------------
0: Iterations Converged, no errors 
2: Normalization constant near zero. Est. HFs unstable 
3: Wrong # allocated for at least one phenotype based on est. HFs 
4: Phenotype freq., based on est. HFs, is 0 for an observed phenotype 
5: Log likelihood has decreased for more than 5 iterations 
6: Est. HFs do not sum to 1.0 
7: Log likelihood failed to converge in 400 iterations 
-----------------------------------------------------------------------------------

]]></loginfo>
<condition role="converged"/>
<iterConverged>140</iterConverged><loglikelihood>-7061.268470</loglikelihood>
<haplotype name="01:01:01:01~01:01:01"><frequency>0.08110</frequency><numCopies>160.7</numCopies></haplotype>
<haplotype name="24:02:01:01~07:01:01:01"><frequency>0.07921</frequency><numCopies>157.0</numCopies></haplotype>
<haplotype name="32:02~03:01:02"><frequency>0.05701</frequency><numCopies>113.0</numCopies></haplotype>
<haplotype name="03:01:03~15:01:01:01"><frequency>0.05449</frequency><numCopies>108.0</numCopies></haplotype>
<haplotype name="31:01:02:01~03:01:01:01"><frequency>0.04338</frequency><numCopies>86.0</numCopies></haplotype>
<haplotype name="68:01:01~11:01:01"><frequency>0.04036</frequency><numCopies>80.0</numCopies></haplotype>
<haplotype name="68:06~15:01:01:01"><frequency>0.03075</frequency><numCopies>61.0</numCopies></haplotype>
<haplotype name="32:02~15:01:01:01"><frequency>0.03008</frequency><numCopies>59.6</numCopies></haplotype>
<haplotype name="11:01:01:01~04:01:01"><frequency>0.02977</frequency><numCopies>59.0</numCopies></haplotype>
<haplotype name="26:01:01~11:04:01"><frequency>0.02725</frequency><numCopies>54.0</numCopies></haplotype>
<haplotype name="02:05:01~13:01:01"><frequency>0.02371</frequency><numCopies>47.0</numCopies></haplotype>
.
.
.

So, that may be the source of the problem.

can't execute PyPop 0.7.0 with example files after installation

Hi there!

I am using Ubuntu 20 04 and i have installed PyPop v. 0.7.0.
I execute PyPop with the example files given using the following 2 commands below

./python-batch -c sample.ini sample.pop

and

./popmeta-batch -c sample.ini sample.pop

and i get the following message:

Traceback (most recent call last):
File "", line 38, in
File "/home/alex/src/python-installer/iu.py", line 277, in importHook
File "/home/alex/src/python-installer/iu.py", line 362, in doimport
File "/home/alex/ihwg/src/buildstandalone/out5.pyz/Meta", line 46, in
File "/home/alex/src/python-installer/iu.py", line 277, in importHook
File "/home/alex/src/python-installer/iu.py", line 362, in doimport
File "/home/alex/ihwg/src/buildstandalone/out5.pyz/libxslt", line 52, in
File "/home/alex/src/python-installer/iu.py", line 277, in importHook
File "/home/alex/src/python-installer/iu.py", line 347, in doimport
File "/home/alex/src/python-installer/iu.py", line 184, in getmod
File "/home/alex/src/python-installer/iu.py", line 46, in getmod
ImportError: libgcrypt.so.11: cannot open shared object file: No such file or directory

What is your suggestion?

Thanks in advance

Input file

Hi, does anyone have a model of the input file to haplo.stas for me to see?
I have the genotypes for 5 SNPs for 95 samples. I also have 96 samples as control and I want to know the prevalence of haplotypes between samples and if there is any correlation between these two groups. Is there a model of an input file that I can use with the read.table function on R? Thanksssssssssssssssssssss

Problem with pypop-haplostats

I encountered a problem when trying to run the test minimal ini file and sample population dataset using the container Karl built. The problem seems to be a source code error

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.