We composed a benchmark data set for evaluating SNP-based inversion detection methods. We collected variant data from three insect data sets. We provide links to the original data sources, a pipeline for processing the raw data into the final usable product for the benchmark, and labels of samples' inversion genotypes.
Genotype labels are provided under the inversion_genotypes
directory, and inversion boundaries are provided under the inversion_boundaries
directory.
Data Set | Citation | Populations (samples) |
---|---|---|
1000 Anopheles genomes | Miles, et al. (2017) | gambiae (81), coluzzii (69) |
DGRPv2 | Mackay, et al. (2012); Huang, et al. (2014) | all (198) |
UBC Sunflower Genome Project | Todesco, et al. (2020) | petiolaris (166), fallax (234), niveus (86) |
Peach | Guan, et al. (2021) | persica (149), kansuensis (37) |
Blue tit | Perrier, et al. (2020) | mainland (111), corsica (343) |
These positives examples were curated to select subpopulations with clear signals.
Chromosome | Population | Inversions | Genotypes Available |
---|---|---|---|
ag1000g_2L | An. gambiae (B. Faso) | 2La | Yes |
ag1000g_2R | An. gambiae (B. Faso) | 2Rb | Yes |
ag1000g_2R | An. coluzzii (B. Faso) | 2Rbc, 2Rd, 2Ru | Yes |
cyan_chrom03 | C. caeruleus (mainland) | chrom03.01 | Predicted |
dgrp2_2L | D. melanogaster | In(2L)t | Yes |
dgrp2_2R | D. melanogaster | In(2R)ns | Yes |
dgrp2_3R | D. melanogaster | In(3R)p, In(3R)mo, In(3R)k | Yes |
peach_pp06 | P. persica | pp06.01 | No |
sunflowers_pet17 | H. niveus canescens | pet17.03 | Predicted |
sunflowers_pet05 | H. petiolaris petiolaris | pet05.01 | Predicted |
sunflowers_pet09 | H. petiolaris petiolaris | pet09.01 | Predicted |
sunflowers_pet11 | H. petiolaris petiolaris | pet11.01 | Predicted |
sunflowers_pet17 | H. petiolaris petiolaris | pet17.01, pet17.02, pet17.04 | Predicted |
sunflowers_pet09 | H. petiolaris fallax | pet09.01 | Predicted |
sunflowers_pet11 | H. petiolaris fallax | pet11.01 | Predicted |
sunflowers_pet17 | H. petiolaris fallax | pet17.01, pet17.03 | Predicted |
Chromosome | Population |
---|---|
ag1000g_3L | An. coluzzii (B. Faso) |
ag1000g_3R | An. gambiae (B. Faso) |
ag1000g_3R | An. coluzzii (B. Faso) |
cyan_chrom03 | C. caeruleus (corsica) |
dgrp2_3L | D. melanogaster |
peach_pp06 | P. kansuensis |
sunflowers_pet05 | H. niveus canescens |
sunflowers_pet09 | H. niveus canescens |
sunflowers_pet11 | H. niveus canescens |
sunflowers_pet05 | H. petiolaris fallax |
Chromosome | Population |
---|---|
ag1000g_3L | An. gambiae (B. Faso) |
ag1000g_X | An. gambiae (B. Faso) |
ag1000g_X | An. coluzzii (B. Faso) |
There are several potential machine learning problems associated with these data.
Given variant data from a population of samples for a chromosome, determine if there are any inversions and their locations. Existing tools require quite a bit of manual parameter tuning to perform localization (segmentation). The goal would be a develop a method that is completely and capable of working even when data include "noise" resulting from pooling samples from multiple locations or closely-related species.
The input to the problem would be a 2D matrix of allele counts. Each position along the chromosome in which nucleotide variation was detected has a column in the matrix. (Positions with fixed nucleotides are excluded -- the matrix is effectively sparse.) The allele counts for each variant are stored in each row. The output would be a 1D vector with an entry for each column with a value of 0 indicating "no inversion" and 1 indicating "inversion detected". See the numpy output format description.
Some thoughts on the approach and associated challenges:
- Perform the evaluation with each chromosome held out once. There are 16 total chromosomes so there would 16 models trained and evaluated.
- The number of samples will vary across chromosomes. You can add rows with zeros (zero padding) so that all matrices have the same number of rows.
- The ordering of the samples is not relevant. The rows can be permuted in the matrix and you should get the same result. There are several ways to handle this. You can generate permutations of the matrices (increasing the training set size) or find a way to sort the entries in a canonical order, or explore deep learning methods like Deep Sets that are permutation invariant.
- The inversion frequencies will vary quite substantially in potential target data sets. You can use bootstrap sampling (sampling with replacement) to create new data sets for each chromosome that simulate different inversion frequencies. This is effectively a form of data augmentation.
- Given the small number of samples, methods such as RNNs and LSTMs might not work. Instead, you might want to consider dividing the chromosomes into overlapping windows and performing classification on each window. Combine the results by averaging the predictions from each window.
- Use the Sørensen–Dice coefficient, Jaccard Similarity, or precision / recall on the predicted and ground truth masks.
Given variant data in a known inversion region from a population of samples for a chromosome, determine the inversion genotypes of the samples (homozygeous inverted, homozygeous standard, and heterozygeous). Existing tools require quite a bit of manual parameter tuning to perform localization (segmentation). The goal would be a develop a method that is completely and capable of working even when data include "noise" resulting from pooling samples from multiple locations or closely-related species.
The input would be a 2D matrix of allele counts. Each position along the chromosome in which nucleotide variation was detected has a column in the matrix. (Positions with fixed nucleotides are excluded -- the matrix is effectively sparse.) The allele counts for each variant are stored in each row. The output would be a 1D vector with an entry for each row (sample) indicating the number of inversion copies (0, 1, or 2). A simpler problem would be to determine if the sample had at least 1 inversion; in which case, the output vector would only contain 0s or 1s. See the numpy output format description.
Some thoughts on the approach and associated challenges:
- When chromosomes only have a small number of samples of a particular inversion type, exclude that class. For example, only two of the Anopheles gambiae samples arehomozygous for the standard orientation of 2Rb, only 1 Anopheles coluzzii sample is homozygous for one of the inversion orientations for 2Rc and 2Rd, and only two of the blue tit samples are homozygous for one of the inversion genotypes.
You will need to download:
- Drosophila: The
dgrp2.vcf
VCF file from the Drosophila Genetics Reference Panel v2 - Anopheles: The biallelic 2L, 2R, 3L, 3R, and X VCF files from the FTP site for the 1000 Anopheles Genomes project phase 1 AR3 data release (
ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/AR3/variation/main/vcf/
). These files are named:ag1000g.phase1.ar3.pass.biallelic.{chrom}.vcf.gz
where chrom is 3L, 2R, or 2L. - H. petiolaris:
Petiolaris.pet_gwas.tranche90_snps_bi_AN50_AF99.vcf.gz
from the UBC Sunflower Genome project - Prunus:
SNP.vcf.gz
(for Prunus) from Figshare - Cyanistes:
BLUE2020VCF.vcf.gz
from Dryad
Place these files in the directory input_files
. The directory contains an empty file named PLACE_INPUT_FILES_HERE
.
To run the pipeline, you will need to install:
Once you've downloaded the data, you can process the individual data sets with the following commands:
$ snakemake prepare_all
Each task is assigned one thread. If you want to run multiple tasks concurrently, use Snakemake's --cores
flag:
$ snakemake --cores 4 prepare_all
The pipeline will convert the variants into three file formats:
- VCF (potentially gzipped): This file format can be read by Asaph
- Plink bed: This file format can be read by pcadapt
- inveRsion: This text file format can be read by inveRsion
- Numpy matrices (.npz): Each Numpy .npz file contains three objects. Allele counts are stored as a 2D matrix. Each position along the chromosome in which nucleotide variation was detected has a column in the matrix. (Positions with fixed nucleotides are excluded -- the matrix is effectively sparse.) The allele counts for each variant are stored in each row. Values are either 0 (homozygeous reference), 1 (heterozygous), 2 (homozygeous alternate), or -1 (unknown). A mapping of column indices to genomic coordinates is provided as a 1D array with one entry for each column of the allele counts matrix. A ground truth mask is provided as a 1D array (1 for inversion, 0 for no inverson) with one entry for each column of the allele counts matrix.
If you use this data set, please cite the original papers from which the data are derived:
- Anopheles 1000 Genomes: Miles, A., Harding, N., Bottà, G. et al. Genetic diversity of the African malaria vector Anopheles gambiae. Nature 552, 96–100 (2017).
- Anopheles 1000 Genomes: Corbett-Detig, R. B., Said, I., Calzetta, M., et al. Fine-Mapping Complex Inversion Breakpoints and Investigating Somatic Pairing in the Anopheles gambiae Species Complex Using Proximity-Ligation Sequencing, Genetics, Volume 213, Issue 4, 1 December 2019, Pages 1495–1511
- Drosophila Genetics Reference Panel: Mackay, T., Richards, S., Stone, E., et al. The Drosophila melanogaster Genetic Reference Panel. Nature 482, 173–178 (2012).
- Drosophila Genetics Reference Panel: Huang, W., Massouras, A., Inoue, Y., et al. Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome Research 24:1193-1208 (2014).
- UBC Sunflower Genome project: Todesco, et al. Massive haplotypes underlie ecotypic differentiation in sunflowers Nature (2020).
- Peach: Guan, et al. Genome structure variation analyses of peach reveal population dynamics and a 1.67 Mb causal inversion for fruit shape. Genome Biology. (2021)
- Blue tit: Perrier, et al. Demographic history and genomics of local adaptation in blue tit populations. Evolutionary Applications. 2020.
and our paper describing the test set:
- Nowling, R.J., Manke, K.R., and Emrich, S.J. Detection of Inversions with PCA in the Presence of Population Structure. PLOS One (2020).