Git Product home page Git Product logo

regens's Introduction

DOI

About REGENS ๐Ÿงฌ

REGENS (REcombinatory Genome ENumeration of Subpopulations) is an open source Python package ๐Ÿ“ฆ that simulates whole genomes from real genomic segments. REGENS recombines these segments in a way that simulates completely new individuals while simultaneously preserving the input genomes' linkage disequilibrium (LD) pattern with extremely high fidelity. REGENS can also simulate mono-allelic and epistatic single nucleotide variant (SNV) effects on a continuous or binary phenotype without perturbing the simulated LD pattern. REGENS was measured to be 88.5 times faster and require 6.2 times lower peak RAM on average than a similar algorithm called Triadsim.

๐ŸŒŸ IMPORTANT NOTICE (PLEASE READ) ๐ŸŒŸ

REGENS's simulated genomes are comprised entirely of concatenated segments from the input dataset's real genomes. If your input genomes are not available for public use, then you may not be allowed to publicly release the simulated dataset. Please consult the institutions that provide you access to your input genotype dataset for more information about this matter.

Instructions to Installing REGENS ๐Ÿ› ๏ธ

  1. Install conda if you haven't already installed either Anaconda or Miniconda
  2. Open your conda terminal. Type "Anaconda" or "Miniconda" into your search bar and open the terminal. It will look like this:
  3. Click the "Anaconda Prompt" app (left) to open the black terminal (right). The terminal's top must say "Anaconda prompt"
  4. Enter conda create --name regens python=3.7 in the terminal to create a new environment called regens with python version 3.7
  5. Enter conda activate regens in the terminal to enter your new environment. If that doesn't work, enter source activate regens
  6. Once in your regens environment (repeat step 5 if you close and reopen the conda terminal), enter pip install regens==0.2.0
  7. Run this command to allow regens to download the remaining files. It will write the simulated data into the examples folder that it downloads. If you experience permissions issues with this step, try these remedies:

Input ๐Ÿฆƒ

REGENS requires the following inputs:

  • YOU MUST PROVIDE: real genotype data formatted as a standard (bed, bim, fam) plink fileset, ideally containing a minimum of 80 unrelated individuals.
  • WE HAVE PROVIDED: a recombination map for every 1000 genomes population.

The provided recombination maps were created by the pyrho algorithm and modified by us to minimize required disk space. Recombination maps between related populations are highly correlated, so you could pair your input dataset with the recombination map of the most genetically similar 1000 genomes population, as is usually done for SNP imputation. If you wish to make your own recombination map, then it must be formatted as described here.

Output ๐Ÿ—

REGENS outputs a standard (bed, bim, fam) plink fileset with the simulated genotype data (and optional phenotype information). If plink is not available to you, please consider bed-reader, which reads (bed, bim, fam) plink filesets into the python environment quickly and efficiently.

In phenotype simulation, REGENS also outputs a file containing the R2 value of the phenotype/genotype correlation and the inferred beta coefficients (see example), which will most likely be close to but not equal to the input beta coefficients.

Simulate genotype data ๐Ÿ’ป

The following command uses ACB.bed, ACB.bim, and ACB.fam to simulate 10000 individuals without phenotypes. This command (or any other) will also complete the final installation step. Windows users should replace all \ linebreak characters with ^.

python -m regens \
  --in input_files/ACB \
  --out ACB_simulated \
  --simulate_nsamples 10000 \
  --simulate_nbreakpoints 4 \
  --population_code ACB \
  --human_genome_version hg19

Simulate genotype data with custom recombination rate dataframes ๐Ÿงฎ

The following command uses custom recombination rate files as input for regens instead of the ones provided in the hg19 and hg38 folders (though the content in input_files/hg19_ACB_renamed_as_custom is just a copy of the content in hg19/ACB).

python -m regens \
  --in input_files/ACB \
  --out ACB_simulated \
  --simulate_nsamples 10000 \
  --simulate_nbreakpoints 4 \
  --recombination_file_path_prefix input_files/hg19_ACB_renamed_as_custom/custom_chr_

Custom recombination rate files are to be named and organized as follows:

  • The recombination map must be a single folder (named hg19_ACB_renamed_as_custom in the example above) with one gzipped tab seperated dataframe per chromosome.
  • Every gzipped tab seperated dataframe must be named as prefix_chr_1.txt.gz, then prefix_chr_2.txt.gz all the way through prefix_chr_22.txt.gz(prefix is named custom_chr in the example above).
  • the .txt.gz files must actually be gzipped (as opposed to a renamed txt.gz extension).
  • Each chromosome's recombination map file must contain two tab separated columns named Position(bp) and Map(cM).

The Position(bp) column in each chromosome's recombination map is to be formatted as follows:

  • The ith row of "Position(bp)" contains the genomic position of the left boundary for the ith genomic interval.
  • The ith row of "Position(bp)" is also the genomic position of the right boundary for the (i-1)th genomic interval.
  • As such, the last row of "Position(bp)" is only a right boundary, and the first row is only a left boundary.
  • Genomic positions must increase monotonically from top to bottom.

The Map(cM) column in each chromosome's recombination map is to be formatted as follows:

  • The ith value of "Map(cM)" is the cumulative recombination rate from the first position to the ith position in CentiMorgans.
  • In other words, the recombination rate of the interval in between any two rows b and a must equal the Map(cM) value at row b minus the Map(cM) value at row a.
  • As such, the cumulative Map(cM) values must increase monotonically from top to bottom.
  • The value of Map(cM) in the first row must be 0.

An example of how this must be formatted is below (remember that there must be one per chromosome, and they must all be gzipped):

Position(bp)	Map(cM)
16050114	0.0
16058757	0.01366
16071986	0.03912
16072580	0.04013
16073197	0.04079

๐ŸŽ Simulate genotype data with phenotype associations ๐Ÿ

Given at least one set of one or more SNPs, REGENS can simulate a correlation between each set of SNPs and a binary or continuous phenotype. Different genotype encodings can be applied:

  • Normally, if A is the major allele and a is the minor allele, then (AA = 0, Aa = 1, and aa = 2). However, you can Swap the genotype values so that (AA = 2, Aa = 1, and aa = 0).
  • You can further transform the values so that they reflect no effect (I), a dominance effect (D), a recessive effect (R), a heterozygous only effect (He), or a homozygous only effect (Ho).

The table below shows how each combination of one step 1 function (columns) and one step 2 function (rows) transforms the original (AA = 0, Aa = 1, and aa = 2) values.

Input = {0, 1, 2} Identity (I) Swap
Identity (I) {0, 1, 2} {2, 1, 0}
Dominance (D) {0, 2, 2} {2, 2, 0}
Recessive (R) {0, 0, 2} {2, 0, 0}
Heterozygous only (He) {0, 2, 0} {0, 2, 0}
Homozygous only (Ho) {2, 0, 2} {2, 0, 2}

Example 1: a simple additive model โ†˜๏ธ

A full command for REGENS to simulate genomic data with correlated phenotypes would be formatted as follows:

python -m regens \
  --in input_files/ACB --out ACB_simulated \
  --simulate_nsamples 10000 --simulate_nbreakpoints 4 \
  --phenotype continuous --mean_phenotype 5.75 \
  --population_code ACB --human_genome_version hg19 \
  --causal_SNP_IDs_path input_files/causal_SNP_IDs.txt \
  --noise 0.5 --betas_path input_files/betas.txt

This command simulates genotype-phenotype correlations according to the following model. If we let y be an individual's phenotype, si be the ith genotype to influence the value of y such that (AA = 0, Aa = 1, and aa = 2), and B be the bias term. The goal is to simulate the following relationship between genotypes and phenotype:

y = 0.5s1 + 0.5s2 + 0.5s3 + B + ฮต

where ฮต ~ N(ฮผ = 0, ฯƒฮต = 0.5E[y]) and E[y] = 5.75.

Notice that the values of ฯƒฮต and E[y] are determined by the --noise and --mean_phenotype arguments.

The following files, formatted as displayed below, must exist in your working directory. input_files/causal_SNP_IDs.txt contains newline seperated SNP IDs from the input bim file input_files/ACB.bim:

rs113633859
rs6757623
rs5836360

input_files/betas.txt contains one (real numbered) beta coefficient for each row in input_files/causal_SNP_IDs.txt:

0.5
0.5
0.5

Example 2: inclusion of nonlinear single-SNP effects โคต๏ธ

python -m regens \
  --in input_files/ACB --out ACB_simulated \
  --simulate_nbreakpoints 4 --simulate_nsamples 10000 \
  --phenotype continuous --mean_phenotype 5.75 \
  --population_code ACB --human_genome_version hg19 --noise 0.5 \
  --causal_SNP_IDs_path input_files/causal_SNP_IDs.txt \
  --major_minor_assignments_path input_files/major_minor_assignments.txt \
  --SNP_phenotype_map_path input_files/SNP_phenotype_map.txt \
  --betas_path input_files/betas.txt

In addition to the notation from the first example, let Si = swap(si) be the ith genotype to influence the value of y such that (AA = 2, Aa = 1, and aa = 0). Also, we recall the definitions for the four nontrivial mapping functions (R, D, He, Ho) defined prior to the first example. The second example models phenotypes as follows:

y = 0.5s1+ 0.5R(S2) + 0.5He(s3) + B + ฮต

where ฮต ~ N(ฮผ = 0, ฯƒฮต = 0.5E[y]) and E[y] = 5.75.

Specifying that (AA = 2, Aa = 1, and aa = 0) for one or more alleles is optional and requires input_files/major_minor_assignments.txt.

0
1
0

Specifying the second genotype's recessiveness (AA = 0, Aa = 0, and aa = 2) and third genotype's heterozygosity only (AA = 0, Aa = 2, and aa = 0) is optional and requires input_files/SNP_phenotype_map.txt.

regular
recessive
heterozygous_only

Example 3: inclusion of epistatic effects ๐Ÿ”€

REGENS models epistasis between an arbitrary number of SNPs as the product of transformed genotype values in an individual.

python -m regens \
  --in input_files/ACB --out ACB_simulated \
  --simulate_nbreakpoints 4 --simulate_nsamples 10000 \
  --phenotype continuous --mean_phenotype 5.75 \
  --population_code ACB --human_genome_version hg19 --noise 0.5 \
  --causal_SNP_IDs_path input_files/causal_SNP_IDs2.txt \
  --major_minor_assignments_path input_files/major_minor_assignments2.txt \
  --SNP_phenotype_map_path input_files/SNP_phenotype_map2.txt \
  --betas_path input_files/betas.txt

y = 0.5s1 + 0.5D(s2)s3+ 0.2Ho(S4)s5s5 + B + ฮต

where ฮต ~ N(ฮผ = 0, ฯƒฮต = 0.5E[y]) and E[y] = 5.75.

Specifying that multiple SNPs interact (or that rs62240045 has a polynomic effect) requires placing all participating SNPs in the same tab seperated line of input_files/causal_SNP_IDs.txt

rs11852537
rs1867634	rs545673871
rs2066224	rs62240045	rs62240045

For each row containing one or more SNP IDs, input_files/betas.txt contains a corresponding beta coefficient. (Giving each SNP that participates in a multiplicative interaction its own beta coefficient would be pointless).

0.5
0.5
0.5

As before, both input_files/major_minor_assignments.txt and input_files/SNP_phenotype_map.txt are both optional. If they are not specified, then all genotypes of individual SNPs will have the standard (AA = 0, Aa = 1, and aa = 2) encoding.

For each SNP ID, input_files/major_minor_assignments.txt specifies whether or not untransformed genotypes of individual SNPs follow the (AA = 2, Aa = 1, and aa = 0) encoding. CAUTION: In general, if a SNP appears in two different effects, then it may safely have different major/minor assignments in different effects. However, if a SNP appears twice in the same effect, then make sure it has the same major/minor assignment within that effect, or that effect may equal 0 depending on the map functions that are used on the SNP.

0	
0	0
1	0	0

For each SNP ID, input_files/SNP_phenotype_map.txt specifies whether the SNP's effect is regular, recessive, dominant, heterozygous_only, or homozygous_only.

regular
dominant	regular
homozygous_only	regular	regular

In context to an epistatic interaction, first the Swap function is applied to SNPs specified by input_files/major_minor_assignments.txt, then map functions specified by input_files/SNP_phenotype_map.txt are applied to their respective SNPs. The transformed genotypes of SNPs in the same row of input_files/causal_SNP_IDs.txt are multiplied together and by the corresponding beta coefficient in input_files/betas.txt. Each individual's phenotype is then the sum of row products, the bias, and the random noise term.

REGENS simulates nearly flawless GWAS data ๐Ÿ’ฏ

The Triadsim algorithm simulates LD patterns that are almost indistinguishable from those of the input Dataset. REGENS uses Triadsim's method of recombining genomic segments to simulate equally realistic data, and we measured it to be 88.5 times faster (95%CI: [75.1, 105.0]) and require 6.2 times lower peak RAM (95%CI [6.04, 6.33]) than Triadsim on average. The following three figures show that REGENS nearly perfectly replicates the input dataset's LD pattern.

  1. For the 1000 genome project's ACB population, this figure compares (right) every SNP's real maf against it's simulated maf and (left) every SNP pair's real genotype pearson correlation coefficient against its simulated genotype pearson correlation coefficient for SNP pairs less than 200 kilobases apart. 100000 simulated individuals were used (please note that the analogous figures in the correctness testing folders were only made with 10000 samples, so they have slightly more noticable random noise than the figure displayed directly below).

Real and simulated R value vs. MAF

  1. For the 1000 genome project's ACB population, this figure plots SNP pairs' absolute r values against their distance apart (up to 200 kilobases apart) for both real and simulated populations. More specifically, SNP pairs were sorted by their distance apart and seperated into 4000 adjacent bins, so each datapoint plots one bin's average absolute r value against its average position. 100000 simulated individuals were used.

Real and simulated R value vs. distance_profile

  1. These figures compare TSNE plots of the first 10 principal components for real and simulated 1000 genomes subpopulations that are members of the AFR superpopulation. Principal components were computed from all twenty-six 1000 genomes population datasets, and the loadings were used to project the simulated individuals onto the PC space. These results demonstrate that REGENS replicates the the input data's overall population structure in simulated datasets. 10000 simulated individuals were used.

TSNE1 vs TSNE2 for 1000 genome African subpopulations

supplementary details ๐Ÿฐ

Thank you concerned reader (for making it this far)!

But our full analysis is in another repository!

Repository structure

Folders for REGENS to download ๐Ÿ“ฅ

  • correctness_testing_ACB: A directory containing bash scripts to test code correctness on the ACB subpopulation, as well as the output for those tests. Correctness testing part 2 is optional and requires plink version 1.90Beta.
  • correctness_testing_GBR: A directory containing bash scripts to test code correctness on the GBR subpopulation, as well as the output for those tests. Correctness testing part 2 is optional and requires plink version 1.90Beta.
  • examples: A directory containing bash scripts that run the data simulation examples in the README.
  • hg19: for each 1000 genomes project population, contains a folder with one gzipped recombination rate dataframe per hg19 reference human autosome.
  • hg38: for each 1000 genomes project population, contains a folder with one gzipped recombination rate dataframe per hg38 reference human autosome.
  • input_files: contains examples of regens input that is meant to be provided by the user. The example custom recombination rate information is copied from that of the hg19 mapped ACB population. Also contains input for the Triadsim algorithm. The genetic input labeled as "not_trio" for Triadsim is comprised of ACB population duplicates and is only meant to compare Triadsim's runtime.
  • runtime_testing_files: A directory containing files that were used to compute runtimes, max memory usage values, and improvement ratio bootstrapped confidence intervals.
  • unit_testing_files: A directory containing bash scripts to unit test code correctness on the ACB subpopulation, as well as the output for those tests.

Folders in the repository ๐Ÿ—„๏ธ

  • images: contains figures that are either displayed or linked to in this github README
  • paper: A directory containing the paper's md file, bib file, and figure
  • thinning_methods: All code that was used to select 500000 SNPs from the 1000 genomes project's genotype data

Files ๐Ÿ“

  • regens.py: the main file that runs the regens algorithm
  • regens_library.py: functions that the regens algorithm uses repeatedly.
  • regens_testers.py: functions used exclusively for correctness testing and unit testing
  • setup.py and _init_.py: allows regens to be installed with pip
  • requirements.txt: lists REGENS' dependencies
  • regens_tests_info.md: Installing REGENS also downloads four folders that test REGENS' functionality. The regens_tests_info.md file explains what they test.

Remedies to known permission issues ๐Ÿฉน

Try these steps if you had permissions issues with the final installation step:

  1. Right click the "Anaconda Prompt" app (left), then click run as administrator. Reinstalling conda in a different directory may fix this issue permenantly.

  2. Your antivirus software might block a file that Anaconda needs (Avast blocked Miniconda's python.exe for us). Try seeing if your antivirus software is blocking anything related to anaconda, and then allow it to stop blocking that file. You could also turn off your antivirus software, though we do not recommend this.

  3. In the worst case, you can download all of the required files with these links:

    1. input_files
    2. correctness_testng_ACB, correctness_testng_GBR, examples, runtime_testing, unit_testing_files
    3. hg19 and hg38

Download the three folders containing the aforementioned 8 folders and unzip all folders (only the folders, keep the recombination maps in hg19 and hg38 zipped). Then place everything in your working directory and run REGENS from your working directory. You should now be ready to use REGENS in your working directory if you have completed the installation steps.

Contributing ๐Ÿ‘

If you find any bugs or have any suggestions/questions, please feel free to post an issue! Please refer to our contribution guide for more details. Thanks for your support!

License

MIT + file LICENSE

regens's People

Contributors

danielskatz avatar greggj2016 avatar trangdata avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

greggj2016 fboehm

regens's Issues

Provide a high-level testing script

This relates to my JOSS review openjournals/joss-reviews#2743

regens contains a number of tests, spread across a few different working directories. At present, there is no clear instruction on how to run each test. In the least, it would be helpful to have

  • A high-level script for calling tests without needing to move from one subdirectory to another (perhaps defaulting to running all tests)
  • A README section clearly explaining how to run tests, and what each test is checking

The JOSS review guide says that software is strongly encouraged to include an automated test site, and that a good package will include 'An automated test suite hooked up to an external service such as Travis-CI or similar'. It would be much easier to move to an automated testing service if the current tests were translated into a python testing environment (e.g. pytest ,nose etc etc). If you chose to move the code to a standard package format for #1, the tests could be included in that structure and run automatically when the package is built.

Installation procedure / instructions

This relates to my JOSS review openjournals/joss-reviews#2743

Installation instructions should include minimum versions for required packages (and indeed python) where applicable, and should be provided as a requirements.txt file (https://pip.pypa.io/en/stable/user_guide/#requirements-files). Thse files make installation of the required packages via pip or anaconda very straighforward.

The JOSS review criteria says that "ideally" installation can be handled via "handled with an automated package management solution.. Creating a package that included regenens.py as an executable would make make installation of the code more straightforward. (Even if the package is not uploaded to something like pip, being able to install through this approach would be helpful).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.