Git Product home page Git Product logo

ibdkin's Introduction

IBDkin

IBDkin is a software for IBD-based kinship estimation. IBDkin scales to hundreds of billions of IBD segments detected in hundreds of thousand individuals.

If you use this software in your published analysis, please cite:

Ying Zhou, Sharon R Browning, Brian L Browning, IBDkin: fast estimation of kinship coefficients from identity by descent segments, Bioinformatics, btaa569, https://doi.org/10.1093/bioinformatics/btaa569

Last update: March 13, 2022, by Ying Zhou, yingzhou(at)ds.dfci.harvard(dot)edu

Content

1 Installation

The following commands download the source code, change the working directory to the source code folder "IBDkin/src-v2.8.7.8/", and create the executable file, IBDkin:

git clone https://github.com/YingZhou001/IBDkin.git

cd IBDkin/src-v2.8.7.8/
make

IBDkin is compiled under linux CentOS 7.5 using gcc 4.8.5. If you encounter any problems compiling the program, please contact the author for assistance.

[top]

2 Running IBDkin

To run some example IBDkin analyses, change the working directory to "IBDkin/example.pub/run", copy the executable file "IBDkin" into this folder, and enter sh run.sh to run the examples in sections 2.1.1 to 2.1.3.

cd IBDkin/example.pub/run
cp ../../src-v2.8.7.8/IBDkin ./
sh run.sh

Alternatively, you can enter the commands in sections 2.1.1 to 2.1.3 to run the examples.

[top]

2.1 Example Analysis

We first change the working directory to the IBDkin/example.pub/run folder with the command:

cd IBDkin/example.pub/run

This folder contains four required IBDkin input files: "ibd.txt" is the list of files containing IBD segments, "ind.txt" is the list of individuals, "range.txt" is the ranges of markers, and "plink.map" is the genetic map (see Required Parameters). Before running the following scripts, we need to copy the executable file IBDkin to the working directory with command:

cp ../../src-v2.8.7.8/IBDkin ./

2.1.1 Computing kinship coefficients

We first run IBDkin with five threads (--nthreads 5), and output kinship coefficients between pairs of individuals who share at least one 4cM IBD segment (see the '--cutcm' parameter in the Optional Parameters section). The output file prefix is specified with the -out flag. The output kinship coefficients are written to the gzip-compressed file "example-1.kinship.gz":

./IBDkin --ibdfile ibd.txt --map plink.map --ind ind.txt --range range.txt --nthreads 5 --out example-1

2.1.2 Computing IBD coverage and masked regions

In certain situations, we may want to output the masked regions (--outmask flag) and the IBD coverage across the genome (--outcoverage flag). The output masked regions will be written to the gzip-compressed file "example-2.mask.gz", and the output IBD coverage will be written to the gzip-compressed file "example-2.coverage.gz".

./IBDkin --ibdfile ibd.txt --map plink.map --ind ind.txt --range range.txt --nthreads 5 --out example-2 --outmask --outcoverage

2.1.3 Distributed analysis

If there is not enough memory to run an analysis on a huge data set, we can use the option --part to distribute computation across multiple compute nodes. In the following script, we divide the analysis into five parts and run each part separately by assigning integers, from 1 to 5, to the variable ${part}.

part=1 # can be set as 2, 3, 4, and 5
./IBDkin --ibdfile ibd.txt --map plink.map --ind ind.txt --range range.txt --nthreads 5 --out example-3.${part} --part 5 ${part}

Caution! Specify a different output file prefix for each part when use the --part option, or you will overwrite your output files.

For advanced parameters setting, please refer to our manuscript(add link to our manuscript) and check the following sections: Required Parameters and Optional Parameters sections.

[top]

2.2 Required Parameters

  • --ibdfile [file] #<string> the [file] contains the pathnames of files that list the IBD segments on each chromosome (one pathname per line, and one line per chromosome). The IBD segments must be stored in gzip-compressed hap-IBD format.
  • --map [file] #<string> the [file] is a genetic map with cM distances in PLINK format, including all chromosomes having IBD segments. The chromosome identifiers must be consistent in the IBD inputs, the genetic map, and the range file.
  • --ind [file] #<string> the [file] includes a list of individuals to be analyzed (one individual per line).
  • --range [file] #<string> the [file] includes three columns: the chromosome identifier, starting bp position, and ending bp position (one chromosome per line).

[top]

2.3 Optional Parameters

Each option is followed by its default value

  • --out ./ # <string> output prefix.
  • --nthreads 2 #<int> number of threads.
  • --kinship 0.0 #<float> minimum output kinship coefficient.
  • --binkb 1000.0 #<float> bin size in kbp to calculate IBD coverage.
  • --fold 4.0 #<float> max permitted fold deviation from the genome-wide median IBD coverage. Regions having greater deviation will be excluded in the kinship estimation.
  • --part 1 1 #<int> <int>total partitions and current partition. The first integer is the total number of partitions that will be analyzed, and the second integer (starting from 1) determines the partition that will be analyzed in this run. This option enables distributed analysis across multiple computation nodes. Please see the Distributed analysis section for an example.
  • --cutcm 4.0 2.0 #<float> <float> The minimum long and short IBD segment cM lengths. The first float is the minimum long IBD segment length, and the second float is the minimum short IBD segment length. A kinship coefficient is estimated for each pair of individuals having at least one long IBD segment. All short and long IBD segments are included when estimating the kinship coefficient in pairs of individuals having more than one long IBD segment.
  • --merge 5.0 20.0 #<float> <float> max cM merge lengths for IBD1 and IBD2 regions. The first float is the maximum length of the IBD0 region between two IBD1 regions that will be merged, the second float is the maximum length of the non-IBD2 region between two IBD2 regions that will be merged.

[top]

2.4 Flags

  • --outmask # output masked regions.
  • --outcoverage # output IBD coverage.

[top]

3 Output Files

IBDkin has three outputs: kinship coefficients, IBD coverage, and masked regions. These outputs are reported in three gzip-compressed files.

3.1 Kinship Coefficients

The kinship coefficient output file (ending with ".kinship.gz") has eight tab-delimited columns:

  1. Sample identifier-1
  2. Sample identifier-2
  3. Number of IBD segments used for calculation
  4. IBD0 proportion
  5. IBD1 proportion
  6. IBD2 proportion
  7. Kinship coefficient
  8. Degree of relationship (values: 0, 1, 2, 3, >3)

[top]

3.2 IBD Coverage

The IBD coverage output file (ending in ".coverage.gz") has four tab-delimited columns:

  1. Chromosome
  2. Start position
  3. End position
  4. IBD coverage

The IBD coverage, the forth column in the output, is the total number of IBD segments intersecting the chromosome interval, with each segment weighted by the proportion of the interval that it covers.

[top]

3.3 Masked Regions

The masked regions output file (ending in "mask.gz") has three tab-delimited columns:

  1. Chromosome
  2. Start position of each masked region
  3. End position of each masked region

[top]

4 Potential memory fails

  • Extreme long sample ID string

If you have sample id with length longer than 32 bytes, you may encounter this problem. You are welcome to blame the author and shoot an email for help. If you are familiar with C code, try to change the value of the variables BUFF_id and BUFF_col in the file "head.h".

  • Small machine memory (<4Gb)

Ask the author for help or try to change the values of the variables BUFF_row and BUFF_col in the file "head.h".

[top]

5 License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

[top]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.