Git Product home page Git Product logo

paralog-finder's Introduction

paralog-finder

DOI
Detects and blacklists paralog RAD loci analyzed in Stacks or ipyrad, based on the McKinney 2017 method (doi:10.1111/1755-0998.12613)

Description

These scripts allow the identification of paralog RAD loci based on the method of McKinney et al. 2017 (doi:10.1111/1755-0998.12613). However, we introduced some modifications and also made some additions. The main modification we made is the way in which the percentage of heterozygote individuals is calculated in datasets with varying degrees of missing data per locus. McKinney et al. (2017) divide the number of heterozygote individuals in the locus over the total number of individuals in the dataset, their results were not skewed because they either simulated data (0% missing data) or use very stringent filters for empirical data (at most 10% missing data per locus). Instead, our calculations consider only the number of individuals present at each locus since typical datasets usually have many loci represented by just a few individuals.

As an addition to their method, we also provide a script that creates a blacklist of paralog loci that can be used in Stacks to run the populations module and exclude the paralogs from the calculations and the output matrices for downstream analyses with other software.

Briefly, the scripts must be run in three steps:

  • Step 1: Calculate loci statistics from the VCF file produced by Stacks or ipyrad with the script HDplot_process_vcf.py
  • Step 2: Plot the statistics and decide limits for maximum heterozigosity (H), maximum and minimum read ratio deviation (D), and minimum of samples per locus with the R script HDplot_graphs.R
  • Step 3: Provide the parameters decided by the inspection of the graphs to the script blacklist_paralogs.py to create the blacklist of paralog loci and a whitelist of singleton loci.

Requirements

  • Python 2.7 with the packages: scipy, pandas, numpy, and statsmodels, these are easy to install if you use miniconda.
  • R with the CRAN packages: ggplot2, ggExtra, and argparse.

Usage

Step 1: HDplot_process_vcf.py

The script calculates heterozigosity and read ratio deviation from a VCF file produced by Stacks or ipyrad. Type python HDplot_process_vcf.py -h to show the help message:

usage: HDplot_process_vcf.py [-h] -i FILENAME

Processes a VCF input file produced by Stacks or ipyrad for plotting
heterozigosity and read depth deviations using the method of McKinney et al.
2017 (doi: 10.1111/1755-0998.12613)

optional arguments:
  -h, --help            show this help message and exit
  -i FILENAME, --input FILENAME
                        Name of VCF input file, must have read depth per
                        allele in each individual (Stacks or ipyrad format)

Example: This command creates a file with the same name as the VCF but with extension .depthsBias: batch_1.depthsBias

python HDplot_process_vcf.py -i batch_1.vcf

Step 2: HDplot_graphs.R

The script plots the loci information contained in the .depthsBias file produced by Step 1. Ttype Rscript HDplot_graphs.R -h to show the help:

usage: HDplot_graphs.R [-h] -i INPUT [--minD MIND] [--maxD MAXD]
                       [--transp TRANSP]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Name of the file with extension .depthsBias
  --minD MIND           Minimum D to display in graph, default = min from data
  --maxD MAXD           Maximum D to display in graph, default = max from data
  --transp TRANSP       Transparency for points in graph, default = 0.01

Example: To create the graphs displaying a range for D between -80 and 80:

Rscript HDplot_graphs.R -i batch_1.depthsBias --minD -80 maxD 80

This command will produce four graphs:
graph1_D_vs_Het.png: The graph shows the read ratio deviation plotted against percentage of heterozygotes in the locus (in this case a single SNP). Since the cloud of points can be misleading when plotting data from millions of SNPs we also added marginal boxplots to all graphs. According to this plot we would pick limits for D between -4 and -4 to retain ~95% of the SNPs.
graph1_D_vs_Het graph2_NumSamplesInLocus_vs_Het.png: We added this graph to show how the pecentage of heterozygotes varies according to the locus depth (individuals sharing the locus). The plot shows for example, that loci that have depths of 60 samples or less have excess of heterozigosity. We can also see the peak of heterozigosity at ~0.03 (with 50% of the data between 0.01 and 0.07) that agrees with what is known of the biology of this species (a mostly selfer grass).
graph2_NumSamplesInLocus_vs_Het graph3_D_vs_NumSamplesInLocus.png: Following the same idea, this graph shows how the D ratio behaves across the variation in locus depth. We can see that loci with 32-62 samples seem to have a bias towards a large positive D. The graph also confirms our putative limits for D between -4 and 4 by looking at the vertical boxplot on the right.
graph3_D_vs_NumSamplesInLocus graph4_AlleleRatio_vs_Het.png: In this case the cloud of points is definitely misleading, we can see that most of SNPs have a 1:1 allelic ratio by looking at the vertical boxplot.
graph4_AlleleRatio_vs_Het

Step 3: blacklist_paralogs.py

Based on the parameters we defined from the graphs we can now use this script to create a blacklist of paralogs. The blacklist can be used in Stacks to exclude these loci from analysis. Type python blacklist_paralogs.py -h for help:

usage: blacklist_paralogs.py [-h] -i FILENAME [--maxH MAX_HETPERC]
                             [--minN MIN_NUM_SAMPLES] [--minD MIN_Z]
                             [--maxD MAX_Z]

Creates blacklist of paralog loci (and whitelist of singletons) based on the
method McKinney et al. 2017 (doi: 10.1111/1755-0998.12613)

optional arguments:
  -h, --help            show this help message and exit
  -i FILENAME, --input FILENAME
                        Name of the input file, the table with extension
                        .depthsBias produced by HDplot.py
  --maxH MAX_HETPERC    Maximum proportion of heterozygotes in a locus,
                        default=0.6 taken from McKinney et al. 2017
  --minN MIN_NUM_SAMPLES
                        Minimum number of samples in locus, default=1
  --minD MIN_Z          Lower limit of read ratio deviation (D), default=-7
                        taken from McKinney et al. 2017
  --maxD MAX_Z          Upper limit of read ratio deviation (D), default=7
                        taken from McKinney et al. 2017

Example: Assume that loci with D between -4 and 4, and with a heterozigosity of at most 0.06 are singletons:

python blacklist_paralogs.py -i batch_1.depthsBias --maxH 0.06 --minD -4 --maxD 4

And the script will create the blacklist (and a whitelist of singletons) and show some information during execution:

Retaining loci with at least 1 sample and with proportion of heterozygotes ≤ 0.06 and D between -4 and 4
249575 loci written to blacklist of paralogs
105403 loci written to whitelist of singletons

References

  • McKinney, G.J, R.K. Waples, L.W. Seeb & J.E. Seeb. 2017. Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping-by-sequencing data from natural populations. Molecular Ecology Resources 17(4):656-669, doi:10.1111/1755-0998.12613

Credits

Citation

DOI
Ortiz, E.M. 2018. paralog-finder v1.0: detect and blacklist paralog RAD loci. DOI:10.5281/zenodo.1209328

paralog-finder's People

Contributors

edgardomortiz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

simo-n-maduna

paralog-finder's Issues

SNP loss in HDplot_process_vcf.py

Hi Edgardo,

Thanks for improving upon HDplot. In giving the py script a try with unfiltered snps out of stacks:populations, the snp count dropped from 656k to 520k, from the populations.snps.vcf to the populations.depthBias. We should be adding metrics, not dropping snps at this point?

I would prefer to apply the py script to the vcf after filtering, but vcftools rightly drops the INFO fields after filtering operations. With vcftools' filtered output, HDplot_process_vcf.py hangs on the NS being equivalent to ".". Following up on this, I used stacks:populations to recalc the INFO fields on the vcf after having filtered for GQ, MAC, mindepth, and minmeandepth. Of 80k snps, the py script now drops 35. I'm not seeing a pattern to the drops, except maybe a low MAF. I'll attach one in transposed format.

So why is the py script dropping snps?

depthBias-drop.txt

syntax error

Hello Edgardo,
I would like to let you know that ever since I upgraded to python 3.9 I have not been able to use your scripts.
Even when trying to get the help message using the command:

python HDplot_process_vcf.py -h

I get the following error:

File "/paralog-finder-1.0/HDplot_process_vcf.py", line 58
print 'skipped {} header lines'.format(header_lines)
^
SyntaxError: invalid syntax

Could you please let me know how to fix this?
Thank you very much!
Kind regards,
Marcos

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.