ksamuk / pixy Goto Github PK

Software for painlessly estimating average nucleotide diversity within and between populations

License: MIT License

Python 100.00%

pixy's Introduction

`pixy`

pixy is a command-line tool for painlessly estimating average nucleotide diversity within (π) and between (d_xy) populations from a VCF. In particular, pixy facilitates the use of VCFs containing invariant (monomorphic) sites, which are essential for the correct computation of π and d_xy in the face of missing data (i.e. always).

The manuscript describing pixy is now published in Molecular Ecology Resources.

Authors

Kieran Samuk (UC Riverside) and Katharine Korunes (Duke University)

Citation

If you use pixy in your research, please cite the manuscript below, and the Zenodo DOI of the specific version of pixy used for your project.

Manuscript
Korunes, K.L. and Samuk, K. (2021), pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data. Molecular Ecology Resources. https://doi.org/10.1111/1755-0998.13326

Zenodo DOI for various versions of pixy
Go to https://zenodo.org/badge/latestdoi/181987337 and find the DOI that matches the version used (the current version is shown first).

Supported Organisms and Data Formats

Currently, pixy only supports computation using biallelic SNPs (and invariant sites) from diploid organisms. VCFs need to be compressed with bgzip and indexed with tabix.

Documentation

https://pixy.readthedocs.io/

Installation

pixy is currently available for installation on Linux/OSX systems via conda, and hosted on conda-forge. To install pixy using conda, you will first need to add conda-forge as a channel (if you haven't already):

conda config --add channels conda-forge

Then, create and activate a new conda environment for pixy:

conda create -n "pixy" python=3.8
conda activate pixy

Then install pixy, htslib, and samtools 1.9:

conda install -c conda-forge pixy=1.2.5
conda install -c bioconda htslib
conda install -c bioconda samtools=1.9 --force-reinstall -y

You can test your pixy installation by running:

pixy --help

If you have trouble installing pixy in an environment using python 3.9, try rolling back to python 3.8.

For information in installing conda, see here:

anaconda (more features and initial modules): https://docs.anaconda.com/anaconda/install/

miniconda (lighter weight): https://docs.conda.io/en/latest/miniconda.html

A note on accuracy

We have made every effort to ensure that pixy provides accurate and unbiased results. As described in the paper, we use population genetic simulations, where the true value of parameters is exactly known, to assess the performance of pixy. However, because of the huge biological and methodological parameter space around preparing VCFs, it is not possible to guarantee that pixy will specifically work for your organism of interest. As such, it is ultimately up to the investigator to check that pixy is performing as expected for their use case, e.g. by simulating their data-generation process, including missingness.

Contribute to pixy

We are very open to pull requests for new features or bugfixes. If a pull request implements a new substantial feature or fixes a substantial bug, we would be happy to considering including contributors as authors on future manuscripts decscribing new versions of pixy.

Development Roadmap (Planned Features as of Feb 2024)

Update to handle GATK missing data formats - COMPLETE (as of version 1.2.11.beta1)
Simplified alternative to "All-Sites VCF" workflow
Support for arbitrary and variable ploidy levels (including sex chromosomes)
Computation of summary statistics from genotype likelihoods
Simplified contributor workflows
Computation of Tajima's D

pixy's People

Contributors

Stargazers

Watchers

Forkers

huangsunan maj18 thomasbrazier heroalone sandyplus wook2014 immortal2333 wangjie07070910 yancheer npb596 stsmall jiangchb jesuscs049

pixy's Issues

chromosome and zarr_path

--output_folder prepends working directory path onto paths from root

Here's a little one I noticed—If I try to specify a path from the root for the --output_folder, my working directory path is automatically pasted onto the beginning of it.

For example, if I'm in the working directory "/path/to/" and run pixy with the option "--output_folder /path/to/my/directory/", the output file gets written to "/path/to//path/to/my/directory/".

It doesn't seem to be happening with the other flags (--vcf and --populations are all fine when I specify a path from root).

Thanks for the great tool and excellent documentation!

Single site calculation of dxy

Hi , I want to calculate my dxy in single site with pixy ,but "--window_size" is required , so my command is "pixy --stats dxy pi fst --vcf hardfiltered.vcf.gz --window_size 1 --bypass_invariant_check 'yes' --populations pop.file --n_cores 12" . However , my result of dxy contains so much " NA " ,even some of avg_dxy shows 0.5 or 1 which is too high , so I want to know how to calculate single site calculation of dxy .Thanks a lot !

Missing comma in requirements for setup.py

Hi Kieran, thanks so much for developing this really great and user friendly tool! Sharing a very minor "bug" that there is a comma missing in the requirements section of setup.py between 'scikit-allel' and 'pandas', so this throws an error of Missing dependencies for scikit-allelpandas if you try to install from source.

For a little more background in case it's helpful for someone else, I was attempting to use pixy with a conda environment that has snakemake installed, but was running into issues with environment inconsistency (even in a fresh virtual environment). So I ended up installing from github with pip (via the virtual environment).

Output Fst variance components

Hi Kieran,

Following on from this issue #51, is there a chance you could add the variance components (summed per window) to the fst outputs so users can calculate the 'ratio of averages' over larger regions post-hoc?

Pretty please :)

Tajima's D

Hi Kieran,

Inspired by this previous feature request for Hudson's FST, I'd love to see Tajima's D implemented in Pixy so I can do all my summary stat calculations with the same (excellent) tool. Looks like scikit-allel has this worked out for windowed & single region calculations too: https://scikit-allel.readthedocs.io/en/stable/stats/diversity.html?#allel.tajima_d

Thanks for considering!
Val

Pi calculation error - ValueError: Cannot convert non-finite values (NA or inf) to integer

Running into what appears to be a numpy/pandas issue when attempting to calculate pi, which may be related to the size of the input VCF (see pandas-dev/pandas#35227). May also be a consequence of installing via pip. Any advice much appreciated!

Log info:

[pixy] pixy 1.0.0.beta1
[pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters...
[pixy] Checking write access...OK
[pixy] Checking CPU configuration...OK
[pixy] Checking for invariant sites...OK
[pixy] Checking chromosome data...OK
[pixy] Checking intervals/sites...OK
[pixy] Checking sample data...OK
[pixy] All initial checks past!

[pixy] Preparing for calculation of summary statistics: pi
[pixy] Data set contains 18 population(s), 1 chromosome(s), and 242 sample(s)
[pixy] Window size: 10000 bp

[pixy] Started calculations at 09:22:09 on 2021-03-25
[pixy] Using 16 out of 96 available CPU cores

[pixy] Processing chromosome/contig C9...
[pixy] Calculating statistics for region C9:1-63239560...
Traceback (most recent call last):
  File "/home/sdturner/.conda/envs/bo-demography/bin/pixy", line 8, in <module>
    sys.exit(main())
  File "/home/sdturner/.conda/envs/bo-demography/lib/python3.5/site-packages/pixy/__main__.py", line 323, in main
    outsorted[cols] = outsorted[cols].astype('Int64')
  File "/home/sdturner/.conda/envs/bo-demography/lib/python3.5/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/home/sdturner/.conda/envs/bo-demography/lib/python3.5/site-packages/pandas/core/generic.py", line 5001, in astype
    **kwargs)
  File "/home/sdturner/.conda/envs/bo-demography/lib/python3.5/site-packages/pandas/core/internals.py", line 3714, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/home/sdturner/.conda/envs/bo-demography/lib/python3.5/site-packages/pandas/core/internals.py", line 3581, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/sdturner/.conda/envs/bo-demography/lib/python3.5/site-packages/pandas/core/internals.py", line 575, in astype
    **kwargs)
  File "/home/sdturner/.conda/envs/bo-demography/lib/python3.5/site-packages/pandas/core/internals.py", line 664, in _astype
    values = astype_nansafe(values.ravel(), dtype, copy=True)
  File "/home/sdturner/.conda/envs/bo-demography/lib/python3.5/site-packages/pandas/core/dtypes/cast.py", line 702, in astype_nansafe
    raise ValueError('Cannot convert non-finite values (NA or inf) to '

OS information
linux-64

bed file as input

Hi @ksamuk,
I noticed that there is a bed option in the upcoming release, but it doesnt look like it is in the main code as of yet. Is there a way I can implement that in v0.95.0? Maybe pass the intervals by reading from a bed file?
thanks,
@stsmall

sites_file argument not working

Describe the bug
A clear and concise description of what the bug is.
I'm trying to use this program with the "--sites_file" argument but I keep getting the error: "Exception: [pixy] ERROR: In the absence of a BED file, a --window_size must be specified." I think this has to do with the error checking in the code in the lines 659-662 of pixy/core.py. I have no --bed_file in my command (like the tutorial for sites_file suggests) and when I try setting window_size 1 it gives me "pandas.errors.EmptyDataError: No columns to parse from file". Without a bed_file I think the code goes straight into checking "if args.window_size is None:" when I think it needs to allow for a sites_file argument

A reproducible example of the bug
Please include the following so we can debug the issue:
(1) The full command you used to run pixy, including all arguments
pixy --stats pi
--vcf 2018wgs3.ef.rmIndelRepeatsStar.chr4.vcf.gz
--populations populations.txt
--sites_file chr4_gene_locations.txt

I can email you a google drive link with my vcf/populations/sites file if needed.
OS information
I'm using Mac OS X

ANGSD/peñalba calc.dxy post-processing script

Hi,
Im working with genotype likelihoods, and would be very handy to see your custom script to process ANGSD/ calc.dxy output mentioned in L397 of the preprint, sorry for the slightly unrelated request.

Many thanks (and great package/prepint!)
Gabby

IndexError during Fst calc when chrom has no variants passing filter

Hello,
I am using a genome assembly that contains a few hundred small contigs in addition to large chromosome-scale scaffolds, in combination with very sparse RAD-data. Unfortunately, pixy crashes during the Fst calculation for some "chromosomes", presumably because there were no variants, or only a single one, left after filtering on the "chromosome":

Traceback (most recent call last):
File "XXX/pixy", line 13, in
sys.exit(main())
File "/XXX/pixy/main.py", line 626, in main
maf_array = allele_freqs[:,1] > args.fst_maf_filter
IndexError: index 1 is out of bounds for axis 1 with size 1

I propose to except IndexError in this case and simply return "NA".
You may argue that this "chromosome" should be removd from the input VCF before running pixy, but I am trying to integrate pixy in a pipeline that calculates various stats for all "chromosomes" in a genome assembly includind these small "junk" contigs, irrespective of how many variants they contain or not.

Cheers, Mathias

No invariant sites found in simulated VCF file

Describe the bug
I generated a number of VCF files from simulations done with msprime/tskit, but Pixy doesn't recognize that there variants in the file. As far as I can tell, it isn't an issue with the formatting of the VCF file (tab separation seems ok in BBedit). This same command works fine on "real" VCFs generated by GATK, so it's not an installation problem. I also tried adding ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> to the header of my simulated VCF with the same result.

The error message from the command line
$ pixy --stats pi --vcf 0-simple-africa.vcf.gz --population sim_pop.txt --window_size 1000 --output_prefix test-out

[pixy] pixy 1.0.4.beta1
[pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters...
[pixy] Checking write access...OK
[pixy] Checking CPU configuration...OK
[pixy] Checking for invariant sites...Exception: [pixy] ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'.

OS information
MacOS Big Sur 11.2.3

Sample files

##fileformat=VCFv4.2
##source=tskit 0.3.5
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=50000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3	tsk_4	tsk_5	tsk_6	tsk_7	tsk_8	tsk_9	tsk_10	tsk_11	tsk_12	tsk_13	tsk_14	tsk_15	tsk_16	tsk_17	tsk_18	tsk_19	tsk_20	tsk_21	tsk_22	tsk_23	tsk_24	tsk_25	tsk_26	tsk_27	tsk_28	tsk_29	tsk_30	tsk_31	tsk_32	tsk_33	tsk_34	tsk_35	tsk_36	tsk_37	tsk_38	tsk_39	tsk_40	tsk_41	tsk_42	tsk_43	tsk_44	tsk_45
1	409	.	A	C	.	PASS	.	GT	1	0	0	1	1	0	0	1	1	0	1	0	0	0	0	1	0	1	1	0	0	1	1	1	1	0	1	0	0	1	1	1	0	0	0	0	1	1	1	1	0	1	0	0	0	0
1	510	.	A	C	.	PASS	.	GT	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	599	.	A	G	.	PASS	.	GT	0	1	1	1	0	1	0	1	0	1	0	0	0	0	1	0	0	0	0	1	1	0	0	0	0	1	0	1	1	0	0	0	1	1	0	1	0	0	0	0	0	0	1	1	1	1
1	617	.	C	G	.	PASS	.	GT	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	675	.	T	C	.	PASS	.	GT	0	1	1	1	0	1	0	1	0	1	0	1	0	1	1	0	1	0	0	1	1	0	0	0	0	1	0	1	1	0	0	0	1	1	1	1	0	0	0	0	0	0	1	1	1	1
1	885	.	A	T	.	PASS	.	GT	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	0	0	0	0

--sites_file argument ouputs only NAs in avg_pi column

I decided to try out the --sites_file argument for some zero-fold and four-fold degenerate sites. I ran pixy v1.2.5b with the following arguments:
pixy --stats pi --vcf allsites.vcf.gz --sites_file zero_fold sites.txt --bed_file primary_genes.bed --populations popfile.txt

My sites file looks like this:
28166317 1056
28231171 1460
28317569 5681
29126773 4129
29253970 2056
29334608 228057
29334608 247168
29334608 276102
29334608 276433
29334608 276540
29334608 278115

(first few lines), is tab delimited and formatted exactly the same as the CHROM and POS columns in the VCF. The run goes to completion, but the output looks like this:

pop chromosome window_pos_1 window_pos_2 avg_pi no_sites count_diffs count_comparisons count_missing
diverse 29378234 211174 236465 NA 0 NA NA NA
diverse 29378234 313635 321061 NA 0 0 0 75040
diverse 29378234 663334 704075 NA 0 0 0 63168
diverse 29378234 870939 928737 NA 0 0 0 75488
diverse 29378234 967457 968068 NA 0 NA NA NA
diverse 29377609 296069 321271 NA 0 0 0 21504
diverse 29377609 1206104 1360768 NA 0 NA NA NA
diverse 29377609 1725734 1864601 NA 0 NA NA NA
diverse 29377609 2508452 2582283 NA 0 0 0 98784
diverse 29377609 3516420 3581643 NA 0 NA NA NA

So, some of these sites are not in a window, and we can see that in the lines where everything is NA. But the ones that show just missing should have some outcome. I know for certain that there are SNPs in those windows, and that there is at least some non-zero pi. For certain the data isn't all missing.

Does pixy have issues running --sites_file with nonstandard chromosome names?

Thanks,
Tal

permutation test

Dear all,

Could permutation test be included in this software? It will be helpful when defining a thresold. Thank you very much.

Best regards,
Zheng zhuqing

vcf file indexing error

Hello,

I have indexed the vcf file with
$tabix -Cp vcf filename command.
But I am getting "Exception: [pixy] ERROR: The vcf is not indexed with tabix." error.

Can you please help.

With best regards

Sandip

Filtering on site quality removes many invariant sites in a VCF from GATK

Deary pixy developers,
Thanks for this great tool! I wanted to just offer a comment on the site-level filtration guide.

I noticed that when I followed the site filtering guide, the vast majority of my invariant sites were dropped from the VCF. This had the effect of drastically inflating the estimates of pi. Here is the filtering command I ran (mostly following the guide):

#!/usr/bin/env bash

# Apply site filters to VCF

MISSING=0.8
QUAL=30
MINDP=5
MAXDP=100

vcftools --gzvcf myvariants.vcf.gz \
--remove-indels \
--remove-filtered-all \
--minQ $QUAL \
--max-missing $MISSING \
--min-meanDP $MINDP \
--max-meanDP $MAXDP \
--recode --stdout | gzip -c > myvariants.filtered.vcf.gz

I included invariant sites in my jointly genotyped data set with the current version of GATK GenotypeGVCFs tool using the --include-non-variant-sites flag.

I believe what is happening is that GATK GenotypeGVCFs only assigns QUAL scores to invariant sites if there are no missing genotypes. Otherwise, the variant site receives no QUAL score. For example, here is an invariant site that does have a QUAL score:

HiC_scaffold_1 937017 . G . 181.42 PASS DP=75;MLEAC=.;MLEAF=. GT:DP:RGQ 0/0:15:45 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:6:18 0/0:0:0 0/0:0:0 0/0:0:0 0/0:4:12 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:2:6 0/0:0:0 0/0:2:6 0/0:3:9 0/0:1:3 0/0:0:0 0/0:0:0 0/0:3:9 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:1:3 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0

And here is one that doesn't:

HiC_scaffold_1 567032 . T . . PASS DP=100 GT:DP:RGQ 0/0:18:54 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:11:33 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:00/0:23:69 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:6:18 ./.:0:0 0/0:1:3 0/0:2:6 0/0:6:18 0/0:7:21 0/0:5:15 0/0:5:15 ./.:0:0 0/0:1:3 0/0:3:9 0/0:1:3 0/0:2:6 0/0:4:12 0/0:4:12 ./.:0:0 ./.:0:0 0/0:1:3

Thus, when I applied the site filters to my VCF file, it dropped any invariant sites that had any missing genotypes, and I was left with a VCF file that had comparatively fewer invariant compared to variant sites. And this threw off my pi calculations (among other things).

I think the simplest solution is to just filter the variant and invariant sites separately, similarly to what is suggested in the guide for the optional population genetic filters.

But since this is not currently what the guide is recommending, I just wanted to let you know about this.

Feel free to let me know if you have any questions, or if I am missing something!
Thanks!
Dave

No such file

Hi, I've tried to run a quick analysis on my vcf, but I get a following error:
Traceback (most recent call last): File "/home/qnick/anaconda3/bin/pixy", line 11, in <module> sys.exit(main()) File "/home/qnick/anaconda3/lib/python3.8/site-packages/pixy/__main__.py", line 220, in main os.mkdir(re.sub(r"[^\/]+$", "", args.outfile_prefix)) FileNotFoundError: [Errno 2] No such file or directory: ''

My command is as follows:
pixy --vcf subSample_SNP_VarFiltr.annotated.locFlt.recode.vcf --populations SubSampl.SG.txt --outfile_prefix test --stats pi --zarr_path zarr --bypass_filtration yes --bypass_invariant_chec yes
I just wanted to test whether it works. I still need to create another VCF with invariant sites for this run.
I work on Ubuntu 20.04.1 LTS, 64-bit on Vbox, and python 3.8

window and step options

Hi,

I was wondering if pixy has options to use sliding windows.

Thanks.

CHROMOSOMES

Hi , --chromosomes this option is the key opton ?

VCF file is missing mandatory header line ("#CHROM...") error, inconsistent error timing

Hi @ksamuk

I'm trying to use pixy to calculate pi on a relatively small dataset (8 diploid individuals). Making an initial test run I used the following command:

pixy --stats pi --vcf for_theta.vcf.gz --zarr_path /theta_for_pw/zarr/ --window_size 10000 --populations pop.txt --bypass_filtration yes --outfile_prefix output/pixy_out

The used vcf was was generated using the instructions provided in the online tutorial for bcftools. The error I see is the following:

Traceback (most recent call last):
  File "/home/peter/miniconda3/envs/vcf-manip/bin/pixy", line 11, in <module>
    sys.exit(main())
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/pixy/__main__.py", line 293, in main
    allel.vcf_to_zarr(args.vcf, zarr_path, region=targ_region, fields='*', overwrite=True)
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/allel/io/vcf_read.py", line 918, in vcf_to_zarr
    fields, samples, headers, it = iter_vcf_chunks(
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/allel/io/vcf_read.py", line 1138, in iter_vcf_chunks
    fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/allel/io/vcf_read.py", line 1636, in _iter_vcf_stream
    headers = _read_vcf_headers(stream)
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/allel/io/vcf_read.py", line 1763, in _read_vcf_headers
    raise RuntimeError('VCF file is missing mandatory header line ("#CHROM...")')
RuntimeError: VCF file is missing mandatory header line ("#CHROM...")

The curious issue is that if I re-run this command the error message can arise after processing a different number of contigs. So I'm not entirely sure what might be going wrong. Please let me know if any additional information would be helpful to troubleshoot this error.

error says my vcf has no invariant

Hi,

I got an error multiple times when I tried to run dxy. the error says I do not have any invariants.

command:
pixy --stats dxy --vcf birds.vcf.gz --populations test.txt --window_size 5000 --n_cores 24 --output_prefix pixy_dxy_birds
I got this error multiple times when I ran the above command.

error:
[pixy] Checking for invariant sites...Exception: [pixy] ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'.

I am sure that I have invariants in my vcf file. but I followed and added the bypass option so the command looks like this now:
pixy --stats dxy --vcf birds.vcf.gz --populations test.txt --window_size 5000 --n_cores 24 --output_prefix pixy_dxy_birds --bypass_invariant_check 'yes'

now everything works but I was wondering if the analysis would be correct because I got this comment from pixy:
pixy] EXTREME WARNING: --bypass_invariant_check is set to 'yes'. Note that a lack of invariant sites will result in incorrect estimates.

Thanks.

Persite Fst - 'float' object is not iterable

Describe the bug

Hi there,

I receive an error when running with a window size of 1 using the same input that runs when I run the same command, but with a larger window size. i.e. the persite estimate won't calculate, but I am able to calculate windowed estimates. I receive the same error with another dataset (different VCF), which made me wonder if this could be a true bug.

Is it possible that sites without data could be giving a problem?

The pixy command and error message

pixy --stats fst \
  --vcf $VCF \
  --chromosomes $INTERVAL \
  --window_size 1 \
  --populations $POPS_FILE \
  --bypass_invariant_check yes \
  --output_folder output_pixy1.0_PERSITE \
  --output_prefix pixy_${INTERVAL}_persite_nomaf

stdout:

[pixy] pixy 1.0.0.beta1
[pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters...
[pixy] Checking write access...OK
[pixy] Checking CPU configuration...OK
[pixy] Checking for invariant sites...OK
[pixy] Checking chromosome data...OK
[pixy] Checking intervals/sites...OK
[pixy] Checking sample data...OK
[pixy] All initial checks past!

[pixy] Preparing for calculation of summary statistics: fst
[pixy] Data set contains 2 population(s), 1 chromosome(s), and 27 sample(s)
[pixy] Window size: 1 bp

[pixy] Started calculations at 19:04:30 on 2021-05-28
[pixy] Using 1 out of 20 available CPU cores

[pixy] Processing chromosome/contig independent_chr...
[pixy] Calculating statistics for region independent_chr:1-19330666...

error:

TypeError: 'float' object is not iterable```

OS information
LINUX (HPC)

Sample files

My VCF and population file are standard and include invariant sites and biallelic SNPs. I am happy to send these along though if it is helpful. I suspect formatting is not the problem, given that the dataset runs in windowed mode.

Thank you for any ideas! Other than this, the code is a breeze and works great thank you!

Erik

Runtimes

Hey, appreciate the software. I've been hitting some snags with runtimes and I'm wondering if you have any parallelization or scatter/gather recommendations.

I have GATK 4.x gvcf with 30 individuals, 1.1 Gb chromosome-level genome (28 scaffolds, about 37 GB filesize).

I first tried pixy v0.95.0 with default variant filtration on a single chromosome (around 70MB), and it ran for almost 48 hours on 3 CPUs and failed due to memory. I then tried it on 20 CPUs (120GB RAM) and it failed again due to memory.

Then I tried filtering the gvcf beforehand, and then dividing the genome into 25 MB bed intervals. But, the zarr database is created for each chromosome individually, so if I run 2 jobs for a single chromosome I get errors about a pre-existing directory for that chromosome.

My solution now is to create the zarr database in the scratch directory so that it can be parallelized on these bed intervals, but obviously this is not very efficient because the zarr database takes many hours to create...

For clarity, I provide a bed file like this:

chr1    50000000        75000000

And run pixy like this, specifying the prefix of the bed file:

lists=/crex/proj/uppstore2017162/b2016151_nobackup/justin/crow_reseq/interval_files


file=$1
chr=$(awk '{print $1}' ${lists}/${file}.bed)
start=$(awk '{print $2}' ${lists}/${file}.bed)
end=$(awk '{print $3}' ${lists}/${file}.bed)

mkdir $SNIC_TMP/data
mkdir $SNIC_TMP/data/${file}

pixy --stats pi fst dxy \
--vcf crow.popseq.SNPs.filtered.g.vcf.gz \
--zarr_path $SNIC_TMP/data/zarr/${file} \
--chromosomes ${chr} \
--window_size 250 \
--bypass_filtration yes \
--populations SPECIES_CROW.txt \
--outfile_prefix output/${file}_out

In general this works okay, but it will take around 38 hours on 2 CPUs to finish a 25 MB region. Therefore this will take several thousand CPU hours to calculate these metrics...which can be done with alternative software in a few dozen CPU hours.

Do you have any recommendations on how to first efficiently create this zarr database? My data is not very large and the genome is comparatively small, so I'm surprised.

Thanks!

Error in NUMEXPR dependency

Hello,
many thanks for going through a lot of work to spare a lot of people the trouble to force conventional tools into making meaningful statistics in the presence of missing data - I certainly have burned a lot of time doing that with cumbersome approaches, like fake VCFs that only contain read deth information, or bedtools etc to cheat PopGenome into not assuming reference alleles for regions without read coverage etc etc....

Unfortunately, I get this error (have pixy through conda) whatever I do, even with only "pixy --help", get:

Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)

Installing pixy using "pip install pixy" doesn't yield "pixy" executable

Describe the bug
Installing pixy using pip install pixy didn't yield a pixy executable in my PATH

The pixy command and error message
I tried installing pixy as follows:

$ sudo -H pip3 install pixy

It installed fine, which I can confirm because I can import the pixy Python package just fine:

>>> import pixy # this runs fine without any errors

However, when I try to run pixy from bash, there's no such executable in my PATH:

$ pixy

Command 'pixy' not found, did you mean:

  command 'pixz' from deb pixz

Try: sudo apt install <deb name>

It seems like the pixy version that is installed using pip is version 1.0.2, so perhaps that's why? I tried uninstalling pixy:

$ sudo -H pip3 uninstall pixy

And then installing using pip but from the commit of the most recent GitHub release:

$ sudo -H pip3 install git+https://github.com/ksamuk/pixy.git@478d8f230ed8060b96e4fcf77d80102707781a02

That didn't work either:

ERROR: Could not find a version that satisfies the requirement python (from pixy) (from versions: none)
ERROR: No matching distribution found for python

OS information

Ubuntu 18.04 in Windows Subsystem for Linux (WSL)
Python 3.6.9

Sample files
N/A

Multi-Chromosome VCF Support

This wasn't originally envisioned in our workflow, but we should probably add the ability to automatically iterate over all chromosomes in a multi-chromosome VCF.

Discrepency in pi estimates between vcftools and pixy

Hello,

I calculated pi diversity using both vcftools and pixy. The estimates from vcftools range from 0-0.0020 whereas the ones from pixy range from 0-0.4. Here is the pipleline I used to generate these.

Generated multi sample vcf file using haplotype caller, genomicsdbimport and genotypevcfs (I did not include the -all-sites option while running genotypegvcfs.

VCFfile pre-filteration:

vcftools --vcf input.vcf --max-missing 0.1 --minQ 30 --maf 0.1 --remove lowdepthindividuals --recode --recode-INFO-all --out output_filtered.vcf
bcftools +prune -l 0.2 -w 50kb output_filtered.vcf -Ov -o output_filtered_ldpruned.vcf

Pi calculations using VCFtools:
vcftools --vcf output_filtered_ldpruned.vcf --window-pi 10000 --out pi

Pi calculations using pixy:

pixy --stats pi --vcf output_filtered_ldpruned.vcf --zarr_path ./zarr \ --window_size 10000 --populations allpop.list --bypass_filtration yes \ --bypass-invariant-sites yes --outfile_prefix results/combined

Here are the resulting files:
From pixy:

From vcftools:

What could be causing such a drastic difference in pi estimates? Should I rerun pixy by including all sites again?

Thank you!

Hudson FST

Dear all,

Could Hudson FST be included in this package? I know this calculation can be achieved by using https://scikit-allel.readthedocs.io/en/stable/stats/fst.html. Thank you for your excellent tool.

Sincerely,
Zheng zhuqing

average fst across windows

Hello,

I've been using pixy v1.2.3.beta1 to calculate Fst (and will do dxy and pi as well later) both for individual sites and across 10kb windows. I was thinking it might be easier to calculate average windows from the per-site results, rather than recalculating everything again. However, I can't seem to get the averages from the per-site results to correspond to those in the windows: please see example below. The no_snps column corresponds to the number of snps in the per-site file, but the averages don't seem to correspond (assuming NA values aren't included; but setting these to 0 and taking the average doesn't get to the average in the 10kb window file, or setting any negative values to 0 or anything else I can think of). Is there something in calculating the across window average I'm missing here?

Thanks a lot!

Example:
i) per site estimate
command: pixy --bypass_invariant_check yes --vcf $vcf --window_size 1 --populations $popfile --n_cores 1 --stats fst --output $outdir --output_prefix pixy.1.$vcf

pop1 pop2 chromosome window_pos_1 window_pos_2 avg_wc_fst no_snps
GR SA 36 15970 15970 NA 1
GR SA 36 15974 15974 -0.0056030669419049 1
GR SA 36 15980 15980 NA 1
GR SA 36 16081 16081 0.0061909416748127 1
GR SA 36 16085 16085 0.0061909416748127 1
GR SA 36 17059 17059 8.673617379884031e-18 1
GR SA 36 17079 17079 8.673617379884031e-18 1
GR SA 36 17405 17405 NA 1
GR SA 36 17407 17407 NA 1
GR SA 36 17435 17435 8.673617379884031e-18 1
GR SA 36 17443 17443 NA 1
GR SA 36 45327 45327 0.2492405950619206 1
GR SA 36 46204 46204 0.1249999999999999 1
GR SA 36 46215 46215 NA 1
GR SA 36 46216 46216 NA 1
GR SA 36 46217 46217 0.3713905136202677 1
GR SA 36 46236 46236 0.2486373546511627 1
GR SA 36 46419 46419 0.2861453942334653 1
GR SA 36 46507 46507 0.5208333333333334 1
GR SA 36 46577 46577 0.3095238095238095 1
GR SA 36 48127 48127 NA 1
GR SA 36 48243 48243 NA 1
GR SA 36 48335 48335 -3.903127820947815e-18 1
GR SA 36 48345 48345 NA 1
GR SA 36 48638 48638 0.2913163685606273 1
GR SA 36 48666 48666 NA 1
GR SA 36 48707 48707 0.287586243782425 1
GR SA 36 48713 48713 NA 1
GR SA 36 48765 48765 NA 1
GR SA 36 48798 48798 -0.0056030669419049 1
GR SA 36 48849 48849 0.2397322003347496 1
GR SA 36 48888 48888 0.0148356443323959 1
GR SA 36 48949 48949 8.673617379884031e-18 1
GR SA 36 49019 49019 NA 1
GR SA 36 49044 49044 0.1628959276018099 1
GR SA 36 49047 49047 NA 1
GR SA 36 49148 49148 NA 1
GR SA 36 49415 49415 0.0917837213777178 1
GR SA 36 49427 49427 0.4370689311485866 1
GR SA 36 49443 49443 0.442158328592032 1
GR SA 36 49473 49473 0.442158328592032 1
GR SA 36 49572 49572 0.6096829599756782 1
GR SA 36 49604 49604 0.4509639564124057 1
GR SA 36 49617 49617 0.0148356443323959 1
GR SA 36 49650 49650 0.3750000000000001 1
GR SA 36 49661 49661 0.3725149673232485 1
GR SA 36 49674 49674 NA 1
GR SA 36 49677 49677 0.3750000000000001 1
GR SA 36 49682 49682 0.3750000000000001 1
GR SA 36 49694 49694 NA 1
GR SA 36 49796 49796 0.3412602942537244 1
GR SA 36 49842 49842 NA 1
GR SA 36 49843 49843 NA 1
GR SA 36 49866 49866 0.0061909416748127 1

ii) 10kb window estimate
command: pixy --bypass_invariant_check yes --vcf $vcf --window_size 10000 --populations $popfile --n_cores 3 --stats fst --output $outdir --output_prefix pixy.10kb.$vcf

pop1 pop2 chromosome window_pos_1 window_pos_2 avg_wc_fst no_snps
GR SA 36 10001 20000 0.0011668872027648 11
GR SA 36 40001 50000 0.362339934891549 43

Finishing time same as starting time

Thanks for the great tool! A minor bug I've noticed: the time given for "All calculations complete" is the same as the starting time. Here's an example from one log file:

[pixy] pixy version 0.95.0
[pixy] Validating VCF and input parameters (this may take some time)...
[pixy] Preparing for calculation of summary statistics: pi,fst,dxy
[pixy] Data set contains 4 population(s), 1 chromosome(s), and 4 sample(s)
[pixy] Started calculations at 23:07:45
[pixy] Building zarr array for chromosome 18...
[pixy] Calculating statistics for chromosome 18...
[pixy] Pi calculations for chromosome 18 complete and written to results/pixy/ssh_wgs_pixy_res_chr18_100kb_pi.txt
[pixy] Dxy calculations chromosome 18 complete and written to results/pixy/ssh_wgs_pixy_res_chr18_100kb_dxy.txt
[pixy] Fst calculations chromosome 18 complete and written to results/pixy/ssh_wgs_pixy_res_chr18_100kb_fst.txt

[pixy] All calculations complete at 23:07:45
[pixy] Time elapsed: 07:47:17

I'm using pixy version 0.95.0, Python 3.8.1, and could give the full conda environment if it is helpful.

Fst estimation error

Hello,

I am running pixy in a small dataset composed by 18 individuals, 1273 scaffolds and around 200,000 SNPs. I am trying to estimate in the same run pi, dxy and fst and also dividing the 18 individuals in different groups. While pi and dxy worked well, Fst estiamtion stops at scaffold number 550, so it is not estimated genome wide.

The error is reported as:
KeyError: ('fst', in 'nDi.2.2.scaf00550')

The code I used is:
pixy --stats pi fst dxy
--vcf D_immitis.cohort.ALL.10K.vcf.gz
--n_cores 44
--output_prefix aus_usa
--populations aus_usa.txt
--window_size 10000

I do not think it is an error in my popfiles nor my dataset, because pi and dxy are well estimated. Do you know any solution for this? Do you want me to send any files?

Thanks in advance.

TypeError: zip argument #3

Hello,
I am quite new to running this kind of program. So, I do not know whether my problem is a bug issue or whether it is because I am doing something wrong. I hope you can help me.

I am running pixy for different ORFs using the following command:

pixy --stats pi fst dxy --n_cores 24 --vcf Capture_ORF_45_filtered.vcf.gz --populations sampleIDs_popfile_45.txt --bed_file genomic_windows.bed

The program starts running, it can do some calculations and produce a .tmp file but at some point, it crashes and I got the following error:

Traceback (most recent call last):
TypeError: zip argument #3 must support iteration
"""

The above exception was the direct cause of the following exception:

TypeError: zip argument #3 must support iteration

OS information
Linux

Do you know what could be the reason?

Thank you very much

ERROR: the following samples are listed in the population file but not in the VCF

Hi Kieran!

I am trying to use pixy with the following command:

pixy --stats pi dxy fst --vcf diverse_pop.scaffolds_withgenes_final_annotation.recode.vcf.gz --zarr_path zarr --populations parents_popmap_pixy.txt --bypass_filtration yes --outfile_prefix output/pixy_out

However, I keep getting the following error:

`Traceback (most recent call last):
File "/mnt/c/Users/tal_shalev/software/anaconda3/lib/python3.8/site-packages/pixy/main.py", line 200, in main
samples_callset_index = [samples_list.index(s) for s in poppanel['ID']]
File "/mnt/c/Users/tal_shalev/software/anaconda3/lib/python3.8/site-packages/pixy/main.py", line 200, in
samples_callset_index = [samples_list.index(s) for s in poppanel['ID']]
ValueError: 367 is not in list

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/mnt/c/Users/tal_shalev/software/anaconda3/bin/pixy", line 11, in
sys.exit(main())
File "/mnt/c/Users/tal_shalev/software/anaconda3/lib/python3.8/site-packages/pixy/main.py", line 202, in main
raise Exception('[pixy] ERROR: the following samples are listed in the population file but not in the VCF: ', missing) from e
Exception: ('[pixy] ERROR: the following samples are listed in the population file but not in the VCF: ', [1, 2, 5, 10, 527, 528, 15, 19, 531, 20, 40, 44, 47, 49, 566, 1081, 58, 85, 87, 88, 94, 95, 97, 113, 116, 120, 121, 126, 128, 133, 648, 137, 138, 144, 148, 149, 150, 662, 156, 158, 164, 165, 171, 173, 175, 176, 181, 191, 202, 203, 204, 214, 220, 221, 230, 231, 232, 235, 241, 243, 253, 261, 263, 264, 266, 779, 784, 281, 285, 301, 306, 322, 325, 329, 331, 334, 336, 340, 343, 344, 347, 351, 366, 367, 368, 378, 381, 386, 392, 394, 402, 404, 411, 415, 421, 426, 430, 435, 438, 444, 446, 454, 461, 465, 466, 490, 493, 497, 501, 503, 505, 508])`

I'm a little confused as all these samples are present in the VCF. I don't have any problem identifying these sample names in the VCF using any other approach, e.g. VCFtools or BCFtools. Have you encountered this before?

Thanks,
Tal

The --reuse_zarr flag and memory use

Hi there,

Thanks for making this nice software, it is both convenient and accurate!

I have a large WGS dataset for which I am calculating diversity stats within and among many populations. E.g., I have 10 populations and I will run all pairwise combinations (1Gbp genome, 20x coverage per sample).

What I did was run a first pass calculation for a single population pair, which generated the zarr directories for the full dataset. The diversity statistics were accurate based on my own calculations. I subsequently ran other contrasts with the --reuse-zarr flag. I was surprised to find that for a single chromosome the run took nearly 8 hours and exceeded 120Gb memory.

This may just be a consequence of my dataset, but I wondered if this is the expected behavior? My limited experience with zarr databases is that they should be fairly efficient to open and run calculations. I was wondering if e.g. the VCF validation step (line 2 below) is part of the memory consumption?

Or otherwise, if you have any other recommendations for a workflow here, I would appreciate it! It would be useful to know if this is expected (I can add more memory) or if I am missing a key element here.

[pixy] pixy version 0.95.0
[pixy] Validating VCF and input parameters (this may take some time)...
[pixy] WARNING: --bypass_filtration is set to 'yes', genotype filtration will be not be performed.
[pixy] EXTREME WARNING: --bypass_invariant_check is set to 'yes', which assumes that your VCF contains invariant sites. Lack of invariant sites will result in incorrect estimates.
[pixy] Preparing for calculation of summary statistics: pi,fst,dxy
[pixy] Data set contains 2 population(s), 1 chromosome(s), and 64 sample(s)
[pixy] Started calculations at 00:36:04
[pixy] If a zarr array exists, it will be reused for chromosome chr1...
[pixy] Calculating statistics for chromosome chr1...
[pixy] Pi calculations for chromosome chr1 complete and written to /crex/proj/snic2020-2-19/private/wagtail/users/erikenbody/07_pixy/species/west_flava_citrine/pixy_west_flava_citrine_1_pi.txt
[pixy] Dxy calculations chromosome chr1 complete and written to /crex/proj/snic2020-2-19/private/wagtail/users/erikenbody/07_pixy/species/west_flava_citrine/pixy_west_flava_citrine_1_dxy.txt
[pixy] Fst calculations chromosome chr1 complete and written to /crex/proj/snic2020-2-19/private/wagtail/users/erikenbody/07_pixy/species/west_flava_citrine/pixy_west_flava_citrine_1_fst.txt

[pixy] All calculations complete at 00:36:04
[pixy] Time elapsed: 07:48:10

My commands with some bash variables included

  pixy --stats pi fst dxy \
  --vcf $VCF \
  --zarr_path $WORK_D/zarr_east_west_flava_pixy.txt_${SLURM_ARRAY_TASK_ID} \
  --reuse_zarr	yes \
  --window_size 5000 \
  --populations $POPS_FILE \
  --bypass_filtration yes \
  --bypass_invariant_check yes \
  --fst_maf_filter 0.05 \
  --outfile_prefix $WORK_D/${OUTNAME}/pixy_${OUTNAME}_${SLURM_ARRAY_TASK_ID}

max () arg is an empty sequence

Hello,

I am running pixy with the following commad:
pixy --stats pi fst dxy --vcf allsamples_ldpruned.vcf --zarr_path ./zarr --window_size 10000 --populations allpop.list --bypass_filtration yes --bypass_invariant_check yes --outfile_prefix results/

This is the error I am running into after the pipeline as run midway:

Traceback (most recent call last):
File "/home/nitinr/.conda/envs/pixy/bin/pixy", line 11, in
sys.exit(main())
File "/home/nitinr/.conda/envs/pixy/lib/python3.8/site-packages/pixy/main.py", line 482, in main
interval_end = max(pos_array)
ValueError: max() arg is an empty sequence

LD pruning before calculating Fst/dxy?

Hello,

I am planning to calculate Fst using pixy. I wanted to ask if you would recommend LD pruning before calculating Fst.
I understand from the paper and the docs that it isn't best to LD prune for calculating pi diversity.

Thank you!
Nitin

Error message for missing FORMAT fields

A warning/error should be produced if the user includes non-existent FORMAT fields in their filtration expression(s).

Make Pixy into a library for importing into Python scripts

I've been using Pixy from the command line to calculate summary statistics on my VCF files and it works great. However some of my data are from simulations generated in a Python script, so it would be fantastic if I could call Pixy functions from within the simulation script itself instead of having to print to VCF first and then read in those output files with yet another script to do analysis on the results.

I'm working on implementing this myself but it's been slow going :) I'm planning to have my simulated genotype data in array format as described by the tskit documentation here: https://tskit.dev/tutorials/getting_started.html#scikit-allel.

I know tskit has its own built-in summary statistics functions but I want to calculate my statistics with the same tools for both my simulated and observed datasets, and I haven't had success with trying to import my observed data into tskit format via tsinfer.

Thanks for considering this request!

Installation problem

Hi，
I managed to install pixy in Linux as your instruction, it seemed work:# All requested packages already installed., but when I input pixy --help, it said bash: pixy: command not found....
Any suggestions?
Thanks in advaced!

pi output gives NA values when there is viable data

I'm not 100% sure this is a bug, but I noticed that I'm getting NA values for certain sites that contain data. Not sure why this is happening?

For example, a NA occurs at site 1539:
main jcf7190000000318 1530 1530 0.0 1 0 10 15 main jcf7190000000318 1531 1531 0.0 1 0 10 15 main jcf7190000000318 1532 1532 0.0 1 0 10 15 main jcf7190000000318 1533 1533 0.0 1 0 10 15 main jcf7190000000318 1534 1534 0.0 1 0 10 15 main jcf7190000000318 1535 1535 0.0 1 0 10 15 main jcf7190000000318 1536 1536 0.0 1 0 10 15 main jcf7190000000318 1537 1537 0.0 1 0 10 15 main jcf7190000000318 1538 1538 0.0 1 0 10 15 main jcf7190000000318 1539 1539 NA 0 NA NA NA main jcf7190000000318 1540 1540 0.6 1 6 10 15 main jcf7190000000318 1541 1541 0.0 1 0 10 15 main jcf7190000000318 1542 1542 0.6 1 6 10 15 main jcf7190000000318 1543 1543 0.0 1 0 10 15 main jcf7190000000318 1544 1544 0.0 1 0 45 10 main jcf7190000000318 1545 1545 0.0 1 0 45 10 main jcf7190000000318 1546 1546 0.0 1 0 45 10

My command:
pixy --stats pi --vcf /path/allsites.g.vcf.gz --populations /path/pixy_population --window_size 1 --output_folder /path/pi_div --output_prefix pixy_output_test

VCF subset:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DDP10PL:illumina DDP1PL:illumina DDP2PL:illumina DDP3PL:illumina DDP4PL:illumina DDP5PL:illumina DDP6PL:illumina DDP7PL:illumina DDP8PL:illumina DDP9PL:illumina jcf7190000000318 1530 . A . . . AN=5;DP=97 GT:AD:DP:RGQ 0:7:7:99 .:14:14:0 .:10:10:0 0:12:12:99 .:12:12:0 .:9:9:0 0:6:6:99 0:9:9:99 .:6:6:0 0:12:12:99 jcf7190000000318 1531 . A . . . AN=5;DP=97 GT:AD:DP:RGQ 0:7:7:99 .:14:14:0 .:10:10:0 0:12:12:99 .:12:12:0 .:9:9:0 0:6:6:99 0:9:9:99 .:6:6:0 0:12:12:99 jcf7190000000318 1532 . A . . . AN=5;DP=99 GT:AD:DP:RGQ 0:7:7:99 .:14:14:0 .:10:10:0 0:14:14:99 .:12:12:0 .:9:9:0 0:6:6:99 0:9:9:99 .:6:6:0 0:12:12:99 jcf7190000000318 1533 . A . . . AN=5;DP=99 GT:AD:DP:RGQ 0:7:7:99 .:14:14:0 .:10:10:0 0:14:14:99 .:11:12:0 .:9:9:0 0:6:6:99 0:9:9:99 .:6:6:0 0:12:12:99 jcf7190000000318 1534 . T . . . AN=5;DP=101 GT:AD:DP:RGQ 0:7:7:99 .:13:14:0 .:10:10:0 0:14:14:99 .:11:12:0 .:9:9:0 0:6:6:99 0:10:11:99 .:6:6:0 0:11:12:99 jcf7190000000318 1535 . T . . . AN=5;DP=101 GT:AD:DP:RGQ 0:7:7:99 .:14:14:0 .:10:10:0 0:14:14:99 .:12:12:0 .:9:9:0 0:6:6:99 0:10:11:99 .:6:6:0 0:12:12:99 jcf7190000000318 1536 . G . . . AN=5;DP=101 GT:AD:DP:RGQ 0:7:7:99 .:14:14:0 .:10:10:0 0:14:14:99 .:12:12:0 .:9:9:0 0:6:6:99 0:11:11:99 .:6:6:0 0:12:12:99 jcf7190000000318 1537 . A . . . AN=5;DP=100 GT:AD:DP:RGQ 0:7:7:99 .:14:14:0 .:9:9:0 0:14:14:99 .:12:12:0 .:9:9:0 0:6:6:99 0:11:11:99 .:6:6:0 0:12:12:99 jcf7190000000318 1538 . A . . . AN=5;DP=100 GT:AD:DP:RGQ 0:7:7:99 .:14:14:0 .:9:9:0 0:12:14:99 .:12:12:0 .:9:9:0 0:6:6:99 0:11:11:99 .:6:6:0 0:12:12:99 jcf7190000000318 1539 . GAAT G 1528.68 . AC=4;AF=0.444;AN=9;DP=92;FS=0;MLEAC=4;MLEAF=0.444;MQ=34.84;QD=32.77;SOR=0.869 GT:AD:DP:GQ:PL 0:5,2:7:99:0,125 1:0,11:11:99:495,0 1:0,8:8:99:360,0 0:14,0:14:99:0,573 1:0,9:9:99:405,0 1:0,7:7:99:315,0 0:6,0:6:99:0,270 0:10,1:11:99:0,432 .:0,0:.:.:. 0:13,0:13:99:0,553 jcf7190000000318 1540 . A G 1043.8 . AC=2;AF=0.4;AN=5;DP=101;FS=0;MLEAC=2;MLEAF=0.4;MQ=28.95;QD=30.77;SOR=0.859 GT:AD:DP:GQ:PL 0:5,2:7:99:0,154 .:0,0:.:.:. .:0,0:.:.:. 0:14,0:14:99:0,578 .:0,0:.:.:. .:0,0:.:.:. 0:6,0:6:99:0,266 1:0,10:10:99:450,0 .:0,0:.:.:. 1:0,14:14:99:619,0 jcf7190000000318 1541 . A . . . AN=5;DP=102 GT:AD:DP:RGQ 0:5:7:99 .:0:14:0 .:0:9:0 0:14:14:99 .:0:11:0 .:0:9:0 0:6:6:99 0:11:11:99 .:0:6:0 0:15:15:99 jcf7190000000318 1542 . T G 1043.8 . AC=2;AF=0.4;AN=5;DP=100;FS=0;MLEAC=2;MLEAF=0.4;MQ=29.09;QD=28.07;SOR=0.859 GT:AD:DP:GQ:PL 0:5,2:7:99:0,159 .:0,0:.:.:. .:0,0:.:.:. 0:14,0:14:99:0,547 .:0,0:.:.:. .:0,0:.:.:. 0:6,0:6:99:0,234 1:0,10:10:99:450,0 .:0,0:.:.:. 1:0,14:14:99:619,0 jcf7190000000318 1543 . G . . . AN=5;DP=102 GT:AD:DP:RGQ 0:5:7:99 .:0:14:0 .:0:9:0 0:14:14:99 .:0:11:0 .:0:9:0 0:6:6:99 0:11:11:99 .:0:6:0 0:15:15:99 jcf7190000000318 1544 . A . . . AN=10;DP=102 GT:AD:DP:RGQ 0:7:7:99 0:14:14:99 0:9:9:99 0:14:14:99 0:11:11:99 0:9:9:99 0:6:6:99 0:11:11:99 0:5:6:99 0:15:15:99 jcf7190000000318 1545 . C . . . AN=10;DP=100 GT:AD:DP:RGQ 0:7:7:99 0:14:14:99 0:9:9:99 0:14:14:99 0:11:11:99 0:9:9:99 0:6:6:99 0:11:11:99 0:6:6:99 0:13:13:99 jcf7190000000318 1546 . T . . . AN=10;DP=102 GT:AD:DP:RGQ 0:7:7:99 0:14:14:99 0:9:9:99 0:14:14:99 0:11:11:99 0:9:9:99 0:6:6:99 0:10:11:99 0:6:6:99 0:15:15:99

Populations file:
DDP10PL:illumina main DDP1PL:illumina main DDP2PL:illumina main DDP3PL:illumina main DDP4PL:illumina main DDP5PL:illumina main DDP6PL:illumina main DDP7PL:illumina main DDP8PL:illumina main DDP9PL:illumina main

Using: Linux

Haploid VCF

Hi there,

Does pixy also work on the haploid VCF files? Do I need to convert the VCF or manipulate the output to get the correct results?

Best regards,

Wei

bed file regions (with chromosomes just as numbers)

Hello,

Thanks a lot for developing pixy, it's very neat! I just had a minor issue (on v1.1.0.beta1, which seemed the same in 1.0.0) when chromosomes in a vcf are named just numbers (0,1,2 etc.), and a bed file is specified to give windows - the bed file chromosomes can't be found in the VCF (WARNING: the following chromosomes in the BED file do not occur in the VCF [1,2,3 etc. for all of them]). It seems to read them in from the bed file as numbers, whereas they are strings from the vcf (this is at lines ~621-624 in core.py, presumably something to do with pandas_read.csv for the bed file at ~608). Works fine if I rename the chromosomes to something which is obviously a string, but just thought I should bring it up in case other people have the same issue. Thanks again!

Best wishes,

Daniel Wood

import pixy

I am connecting a pixy camera to one of the raspberry pi usb ports. I configured it on the pi, as well as pixymon. but I tried to import pixy library to get the xy coordinates and use these coordination to determine a location of the aim. So i want to know why the library does not work and shows error every time that i call it.

thank you

add option to always have intervals spanning entire chromosomes

Hello,
I am trying to have just a single pixy run over all chromosomes, and I would like the intervals (windows) to be automatically starting from position 0 (or is it 1-based?) to the end of each chromosome, independent of the VCF. The currently available options do ot seem to allow this, am I correct? I could only force this behaviour for one chromosome at a time, using a combination of these three arguments:

--chromosomes [CHROMOSOMES]
A single-quoted, comma separated list of chromosome(s) (e.g. 'X,1,2')
--interval_start [INTERVAL_START]
The start of the interval over which to calculate pi/dxy. Only valid when calculating over a single chromosome.
--interval_end [INTERVAL_END]
The end of the interval over which to calculate pi/dxy. Only valid when calculating over a single chromosome.

Alternatively, one might introduce a Pixy option to import a .BED file with regions (intervals, windows) to calculate the stats over; this could then be used very flexibly for any desired intervals.

with best regards,
Mathias

KeyError: nan

Dear Kieran Samuk

Thanks for making pixy, it is a very valuable tool.

I am having an error with running pixy that I can't diagnose. I would very much appreciate some guidance. At the moment pixy starts to calculate stats, then fails with: KeyError: nan. The only outputs file (pixy_tmpfile_8ba3a11fae0041409dabd3d62ea4f743.tmp) is blank. All the details are below.

best wishes,
Daniel Jeffares

I am using an all sites bgzipped, tabix-indexed vcf created with freebayes:
freebayes -f hg38_pvivP01.fa bams/*bam
--min-mapping-quality 30
--min-coverage 10
--strict-vcf
--gvcf
--report-monomorphic
--vcf PvP01_01_v1.vcf &

I am running pixy on a mac. My command is:
pixy --stats pi fst dxy
--vcf PvP01_01_v1.vcf.gz
--populations popfile2.txt
--window_size 10000
--output_prefix pixytest

The error code I get is:
[pixy] pixy 1.0.4.beta1
[pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters...
[pixy] Checking write access...OK
[pixy] Checking CPU configuration...OK
[pixy] Checking for invariant sites...OK
[pixy] Checking chromosome data...OK
[pixy] Checking intervals/sites...OK
[pixy] Checking sample data...OK
[pixy] All initial checks past!

[pixy] Preparing for calculation of summary statistics: pi, fst, dxy
[pixy] Data set contains 3 population(s), 1 chromosome(s), and 90 sample(s)
[pixy] Window size: 10000 bp

[pixy] Started calculations at 18:10:55 on 2021-05-07
[pixy] Using 1 out of 8 available CPU cores

[pixy] Processing chromosome/contig PvP01_01_v1...
[pixy] Calculating statistics for region PvP01_01_v1:1-156579...
KeyError: nan

How should I compare my pi results from pixy with studies that use other approaches?

I have been assessing nucleotide diversity in a species of conifer and recently was able to recall invariant sites to be used in pixy. I ran pixy using a number of different parameters, such as arbitrary windows, and using bed files for single-copy ortholog genes, primary transcripts, etc. My results end up being quite similar, but now I am unsure how to compare them to previously published results that do not use pixy.

For example, using a bed file of genes I get an average pi over all sites of 0.005. A study in Populus deltoides from 2017 (https://doi.org/10.1002/ece3.3466) found an average pi of 0.001 using PopGenome and considered this estimate to be "high nucleotide diversity". In my population, I am expecting diversity to be extremely low. How would you suggest I compare pixy's outcomes to other studies that possibly do not use invariant SNPs? The unfortunate reality is that most researchers do not archive or even call invariant SNPs, so rerunning the analysis of others becomes unfeasible.

I am aware that this does not constitute a fundamental "problem" with the software, but I am wondering whether you have thought about this issue and how comparisons could be made to older studies. Perhaps an approach to standardising results by length of the window being assessed?

Thanks!

Duplicated sites in pixy output using bed intervals?

Hi there,

I have a question about how pixy interprets bed intervals / concerns about potential duplication of sites in the output. Some context:

After using pixy to estimate Fst in 10kbp windows across the whole genome, I identified windows with significant outliers and would now like to calculate Fst for each site within those windows. To do this, I used bash/R to generate bed files with one bp intervals (0-indexed) for all sites within each window of interest. The bed files look like this (except properly tab-delimited):

VONY02000013.1 16930000 16930001
VONY02000013.1 16930001 16930002
VONY02000013.1 16930002 16930003

I'm then giving this bed file to pixy with the following command:

pixy --stats pi fst dxy \
--vcf 13-tissues-allsites-5dp-q30-75comp.recode.vcf.gz \
--populations tiss-phen.txt \
--bed_file sc13-sites-0.bed \
--output_folder tissues-pixy \
--output_prefix sc13-outliers-pixy \
--n_cores 20

Originally I tried providing a bed file with a single interval for each 10kbp window and setting the window_size to 1, but got an error that you cannot use both window_size and bed_file. The 1bp interval bed file strategy seemed more efficient to me than providing each interval separately because I have a bunch of intervals and they're not all adjacent (trying to avoid calculating for windows that I know are unimportant, basically).

I'm a little confused by the output, because it seems like most sites are listed twice with adjacent 1bp intervals, while others are listed once, but contain 2 SNPs (see screenshot). I didn't notice this until I filtered the output to include only those sites with significant Fst values and used those coordinates to output a vcf file of those sites. The vcf files always has ~half the number of sites than the number pixy identified (e.g. if I have 350 significant intervals in my pixy output, I get a vcf file with 181 sites). At first I thought it was an indexing issue, but after outputting a bunch of test cases I have confirmed that pixy says I have (identical Fst) outliers at, say, both sites 10-11 and 11-12, but when I filter my vcf it's just a single site at 10-11 (and 11-12 might be an invariant site, or might not be present in my dataset at all). Unless I'm missing something, it seems like based on pixy output alone I will over-estimate (by ~100%) the number of fixed SNPs in my dataset, for example.

I don't know if this is expected behavior and I'm just confused about how bed intervals work, or if this is some kind of bug, but any clarification you can offer would be much appreciated! Is there a better way to take the output of windowed Fst calculations and get site-specific estimates for those intervals?

Thanks so much in advance, I've really been enjoying using pixy and appreciate the great plotting tutorials!!

Jessie

File not found error

Hi,

I'm trying to run pixy for the first time but am encountering the following error message:

File "/home/anaconda3/bin/pixy", line 11, in
sys.exit(main())
File "/home/anaconda3/lib/python3.8/site-packages/pixy/main.py", line 220, in main
os.mkdir(re.sub(r"[^\/]+$", "", args.outfile_prefix))
FileNotFoundError: [Errno 2] No such file or directory: ''

Any idea what could be causing this issue?

The command that I have used is as follows
pixy --stats pi --vcf haploid_3_filtered_depth2.recode.vcf --populations popFile --zarr_path zarr --bypass_filtration yes --outfile_prefix 3xDP2

Many thanks!

Best practice for estimating πN/πS in pixy

When estimating the ratio of nonsynonymous to synonymous pi, I would normally estimate π separately for each SNP set containing nonsynonymous and synonymous SNPs and take the ratio. I would generally consider invariant SNPs to be synonymous (unless I'm mistaken), so how would one account for this when trying to estimate the ratio in pixy?

Issue in D xy calculation

Using pixy on Linux.
When trying pixy --stats dxy --vcf filtered.sans.outgroup.nomissing.recode.vcf.gz --populations pixy_Med_E_NS.txt --window_size 100000 --n_cores 80 --bypass_invariant_check yes --output_prefix Med_E_Atl

The command only calculate de Dxy on 5 chromosomes (+ half oh the 6th) out of the 10 chromosomes of my VCF and displays :

The above exception was the direct cause of the following exception:
UnboundLocalError: local variable 'pixy_output' referenced before assignment

Tried with and without the --output_prefix command without succes.

Any solution known for this kind of problem?

ksamuk / pixy Goto Github PK

pixy's Introduction

pixy

Authors

Citation

Supported Organisms and Data Formats

Documentation

Installation

A note on accuracy

Contribute to pixy

Development Roadmap (Planned Features as of Feb 2024)

pixy's People

Contributors

Stargazers

Watchers

Forkers

pixy's Issues

Recommend Projects

Recommend Topics

Recommend Org

`pixy`