szpiech / selscan Goto Github PK

View Code? Open in Web Editor NEW

108.0 108.0 33.0 78.7 MB

Haplotype based scans for selection

License: GNU General Public License v3.0

C 70.61% C++ 21.24% Makefile 0.12% R 0.34% TeX 7.68%

selscan's People

Contributors

Stargazers

Watchers

selscan's Issues

.hap file of selscan input

each variants consists of two or more alleles, but .hap file require 0 or 1 for each variant. eg. A and G allele of a SNP, how do i code for this variant (AA, AG, GG) in the .hap file?

How do determine the core mark?

I am a newcomer in positive selection on EHH. I have a question on it: how to determine the core marks? If we have some background on the considered genes, we can set the core marks. However, in most cases, when we scan positive selection along genome, we do not have such information for each gene. Is it right?
If I only have position information for SNPs and have no genetic information, can I still use selscan to scan positive selection? How to arrange the map file? leave the genetic information to 0? Thanks.

Window size selection

Dear Sir,
I am working with SNP data and we ran xp-ehh using selscan and it works smooth less. Currently, I am looking for a window size to rerun the same. Is there any way we can set window size to 5kb or so?
I will be grateful for any help in this regards.
Regards
Devender

some question for XP-EHH calculate

Hi, I used selscan to calculate xp-ehh recently, and I ran into two questions

1). I did not get any useful result after step "norm" (almost all of window has 0 fracCrit value[column 4] in 'norm' program output). After check every file manual, i found some "big" xpehh value like (525.xxx,not equal in each file) in input file of "norm" program, and also found 'inf' in some site. i think it abnormal but i do not know what is wrong

because data size is big, i split iuput file by window,
my comman line is:
$selscan --xpehh --hap $$hap2s[$i] --ref $$hap1s[$i] --map $$maps[$i] --out $out --threads $thread_xpehh
norm --xpehh --files *.xpeff --bp-win --winsize 10000 --min-snps 10

2). in the absence of a good quality recombination map, the recombination rate was assumed to be 1 cM (centiMorgan) per Mbp for genome, i don't know what effect on xp-ehh calculate.

I'm a new bee of poplutation genetics and can't solve it by my self, I would appreciate a lot if you provide some advice to me.

Include latest version of norm in tagged release?

It looks like the latest version of selscan supporting the --ihs-detail flag made it into the 1.1.0-tagged release, but the version of norm which will tolerate the detailed ten-column ihs data did not. Would it be possible to include the latest version of norm in a new tagged release?

Some questions about Selscan

Dear Szpiech,

Recently, i am doing a genome-wide scans of posotive selections on my data sets with your software "Selscan". In my analysis, i have used the --nSL to identifying the candidate selected regions (the input file is phased VCF files), and following your manuals i have got the output files (selscan+norm pipeline). But now there are some questions confused me and need your help.
First, when i use command "norm --nsl --files <file1..out> ... <fileN..out>" to normalize selscan output across frequency bins, the output files generated by norm have no header information about each column. Such as
. 180387 0.0571429 100.917 25.766 1.36524 1.54632 0
. 181142 0.385714 28.9231 49.4042 -0.535396 -0.191878 0
. 182751 0.228571 55.5 33.1852 0.514279 1.15106 0
. 183083 0.385714 29.5385 49.5183 -0.516649 -0.16627 0
. 184456 0.371429 30.8031 47.1786 -0.426327 -0.048742 0

and what is the mean for the last column (such as the value "0").
Second, if i use the command "norm --nsl --files <file1..out>...<fileN..out> --bp-win --winsize 20000" to normalize my selscan results, the formats of the normalized files are like this:
1 20001 0 -1 -1
20001 40001 0 -1 -1
40001 60001 0 -1 -1
60001 80001 0 -1 -1
80001 100001 0 -1 -1
100001 120001 0 -1 -1
120001 140001 0 -1 -1
140001 160001 0 -1 -1
160001 180001 0 -1 -1
180001 200001 19 0 100
200001 220001 8 0 -1
220001 240001 89 0 100
240001 260001 22 0.0454545 100
260001 280001 30 0 100

for the first two columns, i know they are the positions of each window, but i have no idea about the rest three columns, what do they mean?

The last question is how do i identify the candidate regions of positive selections based on the output files of Selscan? For example, if i use the normalized output file generated by norm (without --bp-win parameter), should i just pick out the loci as the candidates of postive selections ,for which their normalized values fall in the extreme 1% or 99% ranges? And if i use the normalized results generated by norm with --bp-win and --winsize, what is the best way to pick out the candidate loci/regions from the genome-wide windowed values?

Looking forward your replies !
Thank you in advance

Core dump when using --nsl in norm

Using the --nsl flag in norm causes the program to crash when the normalisation starts. Using the --ihs flag on the same data works:

ERROR: No such file or directoryterminate called after throwing an instance of 'int'
Aborted (core dumped)

Interpretation of plot

Hi dear,

First, I just want to thank you for answering my previous inquiry,

I have a little confusion that I do hope you would clarify for me, please?

In your paper, I have read the following paragraph:

When the rate of EHH decay is similar on the ancestral and
derived alleles, iHHA/iHHD ’ 1, and hence the unstandardized
iHS is ’ 0. Large negative values indicate unusually
long haplotypes carrying the derived allele; large positive
values indicate long haplotypes carrying the ancestral allele.

So, using the example data, I have tried to plot both snp with extreme values (the highest negative and positive value as well snp with iHS zero value.

This plot produced from locus with the highest iHS value which according to the above paragraph, I am expecting to see extended ancestral allele. However, I do see the opposite (extended derived allele)
https://www.dropbox.com/s/aa5lu2wsw0rpl58/example.Locus5138.ehh.Locus5138.out.ehh.eps?dl=0

This plot for iHS with the highest negative value. Similarly, I do see extended derived allele not vice versa.
https://www.dropbox.com/s/7zmj8ro3o49rqcb/example.Locus6691.ehh.Locus6691.out.ehh.eps?dl=0

This plot for snp with nearly zero iHS value
https://www.dropbox.com/s/azilt4kjh09szhw/example.Locus3712.ehh.Locus3712.out.ehh.eps?dl=0

I would much appreciate if you could explain this,

Thanks

Salha

Makefile issues

Hi Folks,

I am trying to get selscan up and running on our lab computer. Running Ubuntu through Windows as the Make command doesn't work in Windows. When I go to the src folder and use the make command I get this error:

collect2: error: 1d returned 1 exist status
Makefile: 50 : recipe for target 'norm' failed
make: *** [norm] Error 1

There is probably a simple fix for this but i wasn't sure what that would be.
Any help would be greatly appreciated.

-Matt

Segmentation fault/bus error when using --ehh flag v1.1.0b

Under some (currently unclear) circumstances, use of the --ehh flag can cause segmentation faults, bus errors, or output inconsistent with what is expected. v1.1.0 does not appear to be affected.

ihh12 normalization

Hi
I tried to normalize ihh12 files but the distribution i finally obtained was not zero-centered and did not look like normal. So I do not know what's going on.Could you help me?
Thank you in advance
All the best
christophe

normalize XP-nSL

Hi,

I'm currently trying to perform your new method XP-nSL on my data. I have normalized the results with non-overlapping windows of 20kb, and the command I used is norm --xpnsl --files hm_eas_cen_chr1.xpnsl.out --bp-win --min-snps 20 --winsize 20000.
However I'm not sure what the results from column 4 to column 7 in the *.windows file represent. The manual describes it like this:

For XP stats:
<# scores in win> <frac scores gt threshold> <frac scores lt threshold> <approx percentile for gt threshold wins> <approx percentile for lt threshold wins>

I am a bit confused about the meaning of "gt" and "lt". Does the former one mean the positive scores, and the latter one mean negative scores?

The *.windows file I got is as follows:
1 20001 273 0.014652 0 100 5 2.15695 0.360257
20001 40001 259 0.169884 0 100 5 2.90022 0.62959
40001 60001 250 0.868 0 5 5 4.83393 1.36376
60001 80001 239 1 0 1 100 3.38311 2.06651
80001 100001 235 0.595745 0 5 100 3.63697 0.842613

Thanks in advance!

about the xpehh output file

I have a question about the xpehh method , what causes all the result file to be empty?

the command：
selscan --xpehh --tped LW.xpehh.1.tped --tped-ref AWB.xpehh.1.tped --out xpehh

selscan v1.2.0a
Opening LW.xpehh.1.tped...
Loading 40 haplotypes and 1147166 loci...
Opening AWB.xpehh.1.tped...
Loading 42 haplotypes and 1147166 loci...
Opening LW.xpehh.1.tped...
Loading map data for 1147166 loci
Starting XP-EHH calculations.
|==============================================================================|
Finished.

the file：
$ head LW.xpehh.1.tped
1 . 0 9875 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 . 0 10048 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 . 0 67556 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
1 . 0 95330 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 0 0
1 . 0 121258 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 0 0
1 . 0 353207 0 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
1 . 0 743023 0 0 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 1 1 1 0 1 0 1
1 . 0 753016 1 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1
1 . 0 841793 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1
1 . 0 881051 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1

$ head AWB.xpehh.1.tped
1 . 0 9875 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 . 0 10048 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 . 0 67556 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 . 0 95330 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 . 0 121258 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 . 0 353207 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1
1 . 0 743023 0 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 . 0 753016 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 0 1 1
1 . 0 841793 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1
1 . 0 881051 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Do I need to provide additional information?

understand the out file of selscan

The out file of ihs includes the several columns: ID, pos, '1' freq, ihh1, ihh0, unstandardized iHS. i want to know which allele was selected when iHS (the last column) is positive/negative? I do not encode the input file using 0 for ancestral allele and 1 for derived allele. I guess positive iHS represents the allele encoded by '1' was selected, it is may be ancestral (encode by '1') or derived allele (encode by '1'). Is my understand true ???

allow norm to specify arbitrary iHS score cutoff

It will be nice to have a command line option to set a percentile cutoff for iHS scores instead of having a hard 2.0 cutoff.

Support for VCF files

It would be helpful to have direct support for VCF files, with tabix indices

selscan 1.0.2 XPEHH calculation broken

It has come to my attention that I have introduced an error into XPEHH calculations in selscan version 1.0.2. Please use selscan version 1.0.1 or earlier for XPEHH calculations until this is patched!

The question about 'selscan' and 'norm' result

Dear Prof. Szpiech ZA,

My name is Wang Zezhao, a graduate student in Institute of animal sciences of Chinese academy of agricultural sciences, Beijing China. I used ‘selscan’ software to calculate the iHS value across the cattle genome for selection signal.

We obtained a total 478,903 SNPs with estimated iHS scores (selscan --ihs --hap chr1.hap --map chr1.map --out chr1_result). All of SNP sites were normalized using ‘norm’software and then were used for the identification of candidate regions (norm --ihs --files chr1_result.ihs.out --bp-win).

In the output file from ‘norm’, 24,820 regions were detected (Attachment 1). We found the last column of the output file contains four indicating variables (-1, 1, 5, 100). Could you please tell me how to define indicator variables (1, 5, 100)? Indicator variable ‘1’, does it represents the threshold of top 1%?

In our analysis, only 24,820 non-overlapping windows (100 kb) were examined, but based on the indicator variable ‘1’, we totally identified 710 regions. So we feel confused the results, does the 710 of those were in the top 1%? top 1% would be 248 windows according the total of 24,820 non-overlapping windows, so I'm not sure what the top 1% criteria means.

Could you please help me to explain this, or do we need any other concern for our results.

We are very appreciate any help from you.

Thank you for your time, and I look forward to hearing from you.

Best wishes !

Zezhao Wang
result.ihs.100bins.norm.100kb.merge.windows.txt

Error when using physical map sorted based on physical position and ID

I'm running into a silly problem. My physical map file has repetitive physical positions with unique IDs. The data is sorted based on and then in . Example at the end of the message.

When I try to use iHS, I get the following problem: ERROR: Variant physical position must be strictly increasing.
rs201044430 216605 comes after rs112068709 216605
My data is already sorted so that 'rs201044430 216605' comes after 'rs112068709 216605'. So I'm not sure what to do differently.

Best,
Vanessa

Sample file

7 rs28527214 216426 216426
7 rs66644650 216512 216512
7 rs148463803 216515 216515
7 rs28485819 216569 216569
7 rs28498692 216570 216570
7 rs112068709 216605 216605
7 rs201044430 216605 216605
7 rs188651719 216660 216660
7 rs193275413 216662 216662
7 rs137869704 216672 216672
7 rs139968177 216735 216735

Why the XP-EHH process skip the first around 100 SNPs

Hi
I am running the SELSCAN for a while. Then I jest recognized that if I run the default setting of selscan, the output result that will skip the first 100 SNPs, I dont know why this happen? Do you have any suggestions?

Could Selscan be faster?

Hi,
Thanks for this great software, but I found selscan is about >100 times slower than rehh and hapbin. Could you update it and make it faster?

ihh12 norm file

Hi,
When normalizing iHH12 the given output has the header inserted in the first line:

Thank you in advanced,

Aina

norm logfile errors

Hi Zachary,

When submitting my scripts for norm I keep getting the following error:

You have provided 3 output files for joint normalization.
ERROR: logfile No such file or directory

code:
./selscan-linux-1.3.0/norm --ihs --files ch1selscangm.ihs.out ch2selscangm.ihs.out ch5selscangm.ihs.out --bins 50
norm v1.3.0

I have tried using the --log option but still get the same error. The code was working previously and suddenly stopped.

Many thanks

Mark

norm: --crit-percent command not working?

Hi!
I was trying to normalize my ihs values using the "--crit-percent" parameter for finding the SNPs in the most extreme tails. First question is, if a SNP has a extreme value we expect to have a "1" in the last column, right? If this the case, while testing different --crit-percent values (0.5, 0.10, ...., 0.90, 0.95, 0.99) I found that there were no differences in the number of 'extreme' SNPs, so this parameter seems to be not working. Also I found that all SNPs having the "1" in the last column are those with iHS > 2, so I guess that the default "--crit-val" is actually the one being applied.
My command is:
$ norm --ihs --files SNPs.ihs.out --bins 10 --crit-percent 0.95
Any idea of what is going on?
Thanks!

Missing one column in XPEHH output header

XPEHH result contains 8 columns but the header only has 7, the first column (locusID) missed in header.

norm gives ERROR: There are no int flags named --crit-val

This error has been fixed in the latest push to norm.cpp. Binaries will be recompiled on Monday December 22, 2014.

norm Issue

Plot iHs from selscan

Dear Zachary A Szpiech,

Can you please help me with this matter?

I did manage to run the test of signature of selection using selscan and now I am trying to make a plot.

I am trying to use the r script you provide in src folder. Is this how it should be run, please?

plotHaplotypes(example.ihs.out) of course after running the function first.

I am looking forward to hearing from you.

Thanks

Interpreting positive and negative XP-EHH values

Hi there,

I wonder what the negative and positive XP-EHH values mean. In the original publication of Sabeti et al 2007, they state that positive values reflect recent selection in population A and negative values represent selection in the population B. In selscan, the input file are --hap and --ref: So would a negative EHH-value mean likely selection in the reference population?

Thanks for clarification.

And btw thank you for this great program!

Best wishes

returning iHH0/iHH1 values for right vs. left?

Currently, selscan returns iHH for each variant at a site, as well as the unstandardized iHS. In calculating these former two values, selscan presumably first calculates iHH to the left and the right for each variant. Is it possible to return these intermediate values (iHH0_left, iHH0_right, iHH1_left, iHH_right) as well?

get ancestral allele

Hi,

Very nice tool. After reading the manual, I have some puzzles for the input files. I can not understand the ancestral allele when I just have a domesticated population. Can I just use the reference allele in the VCF file instead of ancestral allele ?

More, If I have the ancestral allele file, can I include the heterozyous SNP in the hap file ? From the example files deposited in the selscan directory, I just saw a single haploid copy from an individual, does this mean the derived allele should be homozygous in a diploid individual ?

I hope you can help me with above questions.

Many thanks!

selscan reporting scores close to chromosome edges

When the edge of a chromosome is reached and EHH > EHH_CUTOFF selscan should abort computation at that locus and move to the next. However, selscan presently integrates the truncated curves and reports scores at those loci.

This does not affect loci near centromeres, which are handled as large gaps and are appropriately skipped.

* buffer overflow detected *: ./selscan terminated

I am currently investigating a bug in the Linux version that results in a buffer overflow error. It is currently unknown what conditions cause it.

*** buffer overflow detected ***: ./selscan terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7ff0e0b98f47]
/lib/x86_64-linux-gnu/libc.so.6(+0x109e40)[0x7ff0e0b97e40]
/lib/x86_64-linux-gnu/libc.so.6(+0x1092a9)[0x7ff0e0b972a9]
/lib/x86_64-linux-gnu/libc.so.6(_IO_default_xsputn+0xdd)[0x7ff0e0b0a13d]
/lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d42)[0x7ff0e0ad8702]
/lib/x86_64-linux-gnu/libc.so.6(__vsprintf_chk+0x94)[0x7ff0e0b97344]
/lib/x86_64-linux-gnu/libc.so.6(__sprintf_chk+0x7d)[0x7ff0e0b9728d]
./selscan[0x406606]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7ff0e0e55e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7ff0e0b823fd]
======= Memory map: ========
00400000-00423000 r-xp 00000000 08:01 55393                              /home/szpiech/code/selscan/src/selscan
00622000-00623000 r--p 00022000 08:01 55393                              /home/szpiech/code/selscan/src/selscan
00623000-00624000 rw-p 00023000 08:01 55393                              /home/szpiech/code/selscan/src/selscan
013f3000-01529000 rw-p 00000000 00:00 0                                  [heap]
7ff0dc000000-7ff0dc021000 rw-p 00000000 00:00 0 
7ff0dc021000-7ff0e0000000 ---p 00000000 00:00 0 
7ff0e028d000-7ff0e028e000 ---p 00000000 00:00 0 
7ff0e028e000-7ff0e0a8e000 rw-p 00000000 00:00 0                          [stack:2282]
7ff0e0a8e000-7ff0e0c43000 r-xp 00000000 08:01 13133                      /lib/x86_64-linux-gnu/libc-2.15.so
7ff0e0c43000-7ff0e0e43000 ---p 001b5000 08:01 13133                      /lib/x86_64-linux-gnu/libc-2.15.so
7ff0e0e43000-7ff0e0e47000 r--p 001b5000 08:01 13133                      /lib/x86_64-linux-gnu/libc-2.15.so
7ff0e0e47000-7ff0e0e49000 rw-p 001b9000 08:01 13133                      /lib/x86_64-linux-gnu/libc-2.15.so
7ff0e0e49000-7ff0e0e4e000 rw-p 00000000 00:00 0 
7ff0e0e4e000-7ff0e0e66000 r-xp 00000000 08:01 13151                      /lib/x86_64-linux-gnu/libpthread-2.15.so
7ff0e0e66000-7ff0e1065000 ---p 00018000 08:01 13151                      /lib/x86_64-linux-gnu/libpthread-2.15.so
7ff0e1065000-7ff0e1066000 r--p 00017000 08:01 13151                      /lib/x86_64-linux-gnu/libpthread-2.15.so
7ff0e1066000-7ff0e1067000 rw-p 00018000 08:01 13151                      /lib/x86_64-linux-gnu/libpthread-2.15.so
7ff0e1067000-7ff0e106b000 rw-p 00000000 00:00 0 
7ff0e106b000-7ff0e1080000 r-xp 00000000 08:01 5873                       /lib/x86_64-linux-gnu/libgcc_s.so.1
7ff0e1080000-7ff0e127f000 ---p 00015000 08:01 5873                       /lib/x86_64-linux-gnu/libgcc_s.so.1
7ff0e127f000-7ff0e1280000 r--p 00014000 08:01 5873                       /lib/x86_64-linux-gnu/libgcc_s.so.1
7ff0e1280000-7ff0e1281000 rw-p 00015000 08:01 5873                       /lib/x86_64-linux-gnu/libgcc_s.so.1
7ff0e1281000-7ff0e137c000 r-xp 00000000 08:01 13153                      /lib/x86_64-linux-gnu/libm-2.15.so
7ff0e137c000-7ff0e157b000 ---p 000fb000 08:01 13153                      /lib/x86_64-linux-gnu/libm-2.15.so
7ff0e157b000-7ff0e157c000 r--p 000fa000 08:01 13153                      /lib/x86_64-linux-gnu/libm-2.15.so
7ff0e157c000-7ff0e157d000 rw-p 000fb000 08:01 13153                      /lib/x86_64-linux-gnu/libm-2.15.so
7ff0e157d000-7ff0e165f000 r-xp 00000000 08:01 144528                     /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7ff0e165f000-7ff0e185e000 ---p 000e2000 08:01 144528                     /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7ff0e185e000-7ff0e1866000 r--p 000e1000 08:01 144528                     /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7ff0e1866000-7ff0e1868000 rw-p 000e9000 08:01 144528                     /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7ff0e1868000-7ff0e187d000 rw-p 00000000 00:00 0 
7ff0e187d000-7ff0e189f000 r-xp 00000000 08:01 13154                      /lib/x86_64-linux-gnu/ld-2.15.so
7ff0e1a88000-7ff0e1a8d000 rw-p 00000000 00:00 0 
7ff0e1a9c000-7ff0e1a9f000 rw-p 00000000 00:00 0 
7ff0e1a9f000-7ff0e1aa0000 r--p 00022000 08:01 13154                      /lib/x86_64-linux-gnu/ld-2.15.so
7ff0e1aa0000-7ff0e1aa2000 rw-p 00023000 08:01 13154                      /lib/x86_64-linux-gnu/ld-2.15.so
7fffad9a3000-7fffad9c4000 rw-p 00000000 00:00 0                          [stack]
7fffad9d1000-7fffad9d2000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted (core dumped)

Implementation of the 'ihs' version of iHS

As noted in the Sabeti et al. (2007) Supplemental Information, the program 'ihs' computes an undocumented modified iHS statistic. At moderate to large sample sizes (>50 haplotypes), 'ihs' and selscan are highly correlated (r > 0.99), but at very low sample sizes (~10 haplotypes) the two diverge somewhat.

I plan to add an option to compute the 'ihs' version of iHS.

Are these methods included in selscan suitable for multiple allelic makers?

Dear author:
I am working on detecting the selection signals with a kind of marker that has more than two alleles, so I am wondering if there exists a method that is suitable for this kind of marker. Below are the example lines for my marker in vcf format (please ignore the allele frequency).
1 4485 BLK_1_4485_26943 0 1,2,3,4,5,6,7,8 . . . GT 0|0 1|1 2|2 2|2 6|6 1|1 7|7 5|5 6|6 0|0 8|8 8|8 7|7 6|6 7|7 0|0 0|0 0|0 6|6 0|0 4|4 6|6 5|5 1|1 0|0 1|1 1|1 4|4 2|2 7|7 7|7
1 75460 BLK_1_75460_88853 0 1,2,3,4,5 . . . GT 0|0 1|1 2|2 2|2 0|0 1|1 0|0 1|1 2|2 0|0 3|3 0|0 0|0 2|2 0|0 0|0 0|0 0|0 2|2 0|0 4|4 0|0 1|1 1|1 0|0 1|1 1|1 5|5 5|5 3|3 4|4

This kind of marker, which we call it SNPLDB, was constructed form whole genome SNPs data basing on the LD (D') value between SNPs. (Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel
B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D (2002) The structure of haplotype blocks in the human genome. Science 296:2225–2229 )
Looking forward to your reply, any help will be appreciated.

Physical vs genetic map

This may be a trivial problem for seasoned veterans, but I keep having issues with getting a genetic map for my VCF file. In your manuscript, you hinted that you ran selscan with only a physical map on the CEU22 data? From the command line parameters, it doesn't look like you can do that? Is it possible?
Also, where/how can I get a genetic map for my VCF files (from 1000 genome project)? The 1000 genome project has omni-recombination rate file, but it doesn't contain rate or genetic distance for every SNP. How can I interpolate that for each SNP in my VCF file? Any advice would be much appreciated. Surprisingly, there is little forum conversation on this.

error message in norm

Just a heads-up about an error flag in norm: I was using norm and getting an error about not being able to find the log file. It turns out that the log file was fine, and the problem was that one of the 1000 files I was passing with the --files flag was not there.

XP-EHH normalization and analyzis of non-overlapping windows

Hi,

I used the following command to normalize and analyze XP-EHH results in non-overlapping windows of fixed bp size (100kb)
norm --xpehh --files ./Results/XP-EHH/BEN_${p}.chr*.xpehh.out --bp-win --min-snps 20 --winsize 100000 --crit-percent 0.01;
I was wondering if i am doing the right thing (I want to identify the windows with the 5% most extreme pourcentage of extreme score), and what this function really do.
When I'm looking at the final results, I see some windows with 0% extreme value which are significant at 5%. I see also in the log file that the identification of the most extreme windows seems to be done by bins of numbers of SNP by windows, and that for the bins with low number of SNPs (<40) , the threshold value for a significant signal is 0. Is it normal?

Thanks,
Jacqueline

norm_xpehh_win100kb.pdf

|iHS| >2 uses standardized iHS or unstandardized iHS?

I want to know the extreme positive iHS or extreme negative iHS ( |iHS| >2) uses standardized iHS or unstandardized iHS?

--crit - val : Set the critical value such that a SNP with | iHS| >
CRIT_VAL is marked as an extreme SNP. Default as in Voight et al.
Default : 2.00

Thank you in advance

Weiwei Fu

Request: alternate version of XP-EHH for better performance on incomplete sweeps

The current implementation of XP-EHH uses the formula from Sabeti et al. (2007) for EHH, with (N choose 2) in the denominator. There is an alternate version that can be found in the supplement of Wagh et al. 2012 (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0044751#s6) (supplementary note here: http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0044751.s011&type=supplementary), and I wonder if you would be willing to implement it as an optional flag? The difference is in the denominator of EHH: (N_A choose 2) + (N_a choose 2) where N_A and N_a are the numbers of chromosomes with each of the alleles at the core SNP. The benefit of this definition is that the denominator is the number of pairwise comparisons that could possibly be identical given the alleles at the core SNP, instead of just the total number of possible pairwise comparisons. This makes it much more sensitive for ongoing sweeps in which the beneficial allele has not yet swept through most of the population (and so the number of pairwise comparisons is greater than the number of those comparisons that could possibly result in a match). I have already implemented this in my forked repository by adding in a few lines in selscan-main.cpp if this is helpful. Thanks!!

Unable to reproduce example file scores

Hello!

I am running selscan on Ubuntu 18 LTS. I am unable to get the same iHS values as the example.ihs.out file here. My log looks like this:

../src/./selscan --ihs --hap example.hap --map example.map --out test.example --threads 10
v1.3.0
Calculating iHS.
Input filename: example.hap
Map filename: example.map
Output file: test.example.ihs.out
Threads: 10
Scale parameter: 20000
Max gap parameter: 200000
EHH cutoff value: 0.05
Alt flag set: no
--skip-low-freq set. Removing all variants < 0.05.
Removed 2750 low frequency variants.
WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at Locus5
WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at Locus2
WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at Locus18
WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at Locus13
WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at Locus34
WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at Locus39
WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at Locus4

while the scores in both files look like this:

test.example.ihs.out:Locus188 108585 0.3 0.0263938 0.020108 0.118132
test.example.ihs.out:Locus1880 1168591 0.14 0.33279 0.0168633 1.29523
test.example.ihs.out:Locus1881 1169264 0.06 0.998161 0.0138525 1.85767
test.example.ihs.out:Locus1882 1170860 0.12 0.327554 0.0168659 1.28827
test.example.ihs.out:Locus1883 1172545 0.24 0.146858 0.0156062 0.9736
test.example.ihs.out:Locus1884 1173195 0.06 1.12876 0.0113121 1.99906
test.example.ihs.out:Locus1885 1173749 0.12 0.327569 0.0127553 1.40961
test.example.ihs.out:Locus1887 1174701 0.34 0.0489453 0.0206153 0.375522
test.example.ihs.out:Locus1888 1174777 0.34 0.0489453 0.0206153 0.375522
test.example.ihs.out:Locus1889 1175022 0.06 0.562215 0.00961574 1.76692

example.ihs.out:Locus188 108585 0.3 0.0173648 0.0150461 0.143324
example.ihs.out:Locus1880 1168591 0.14 0.157038 0.0110095 2.65773
example.ihs.out:Locus1881 1169264 0.06 0.573633 0.0092492 4.12745
example.ihs.out:Locus1882 1170860 0.12 0.20044 0.010789 2.92199
example.ihs.out:Locus1883 1172545 0.24 0.0475636 0.0105927 1.5019
example.ihs.out:Locus1884 1173195 0.06 0.498914 0.00738917 4.21242
example.ihs.out:Locus1885 1173749 0.12 0.200548 0.00833627 3.18044
example.ihs.out:Locus1887 1174701 0.34 0.029491 0.00829141 1.26886
example.ihs.out:Locus1888 1174777 0.34 0.029491 0.00829141 1.26886
example.ihs.out:Locus1889 1175022 0.06 0.123439 0.00563158 3.08735

The same applies to other scores like XP-EHH, though I'm primarily interested in iHS.

The hap and map files I have downloaded have the same number of lines as here.

How can I fix this?

Issue with --trunc-ok

Dear zachary,

Sorry to bother you again but I am really experience problem with running selscan on my data.

This is my command options:

./selscan --ihs --vcf gHaplotypeCaller_bialllelicfinalSNPs.Autosome.chr1.phased_biallelic_ch1_AF_0.05_snp_195275175.vcf.recode.vcf --map gHaplotypeCaller_bialllelicfinalSNPs.Autosome.chr1.phased_biallelic_AF_0.05.map --max-gap 230000 --cutoff 0.9 --trunc-ok --out test

But, no success yet, this is what I see in the log file. These snp are on both ends of the chromosome 1 and it seems that (--trunc-ok) is not doing its job and disregarding these snp from calculation.
Please see the following example of the output:

WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P402
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P436
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P442
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P520
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P603
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P696
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P195275687
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P195275689
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P195275713
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P195275864
WARNING: Reached chromosome edge before EHH decayed below 0.9. Skipping calculation at C1P195275865

I would appreciate if you would advise me regarding this matter?

I am looking forward to hearing from you.

Thanks

Compilation failure

At some point I introduced a compilation error into the master branch.

make
g++ -O3 -m64 -mmmx -msse -msse2 -c selscan-main.cpp -I../include
selscan-main.cpp: In function ‘int main(int, char**)’:
selscan-main.cpp:253: error: cannot convert ‘std::vector<int, std::allocator<int> >’ to ‘int’ in initialization
make: *** [selscan-main.o] Error 1

about the ehh output file

The out file of ehh includes 'physicalPos' 'geneticPos' '1ehh' '0ehh',but my data has 5 columns .I want to konw what the last column of data means. My out file like this:
-7622 -0.007622 0.790850 0.346774 0.239184
-7398 -0.007398 0.790850 0.356855 0.243265
-1961 -0.001961 0.790850 0.502016 0.302041
-1143 -0.001143 1.000000 0.502016 0.328163
-549 -0.000549 1.000000 0.514113 0.333061
-153 -0.000153 1.000000 0.824597 0.458776
0 0.000000 1.000000 1.000000 0.124898
667 0.000667 1.000000 0.612903 0.373061
890 0.000890 1.000000 0.580645 0.360000

EHH output clarification

Hi there,

Thanks so much for this software - I've find it extremely useful in my research.

I have a clarifying question about the EHH output. I'm currently using selscan v1.2.0a but have made sure the documentation is consistent with regard to the EHH output with the newest verison. While the documentation describes 4 columns of output for EHH, my output file has 5. For e.g:

-214    -0.000089       0.404762        0.650116        0.518347
-154    -0.000061       0.404762        0.756315        0.602225
-140    -0.000057       0.404762        0.834955        0.664335
-133    -0.000055       0.404762        0.860409        0.684440
-117    -0.000052       0.404762        0.905633        0.720158
-75     -0.000042       0.753968        0.932153        0.745308
-55     -0.000036       0.753968        0.959059        0.766560
-47     -0.000032       0.753968        1.000000        0.798895
0       0.000000        1.000000        1.000000        0.012040
59      0.000089        1.000000        0.965762        0.774816
61      0.000090        0.944444        0.688323        0.555020
148     0.000104        0.944444        0.660617        0.533138
250     0.000179        0.468254        0.512098        0.410102

I would have considered the first physical distance and the second genetic distance. But then there are three remaining columns, only two of which refer to EHH for allele 1 and allele 0.

Can you please reiterate what these columns represent?
Thanks,
Julia

Request: option for allow missing data.

In the release 1.0.1, you removed the support for missing data. I would like to know if it could be possible to provide an option when you have a lot of missing data, because in some datasets it is not possible to have any SNP without at least one haplotype with missing data.

Thank you very much.

P-value calculation for xpehh

Dear Sir,
I have performed selscan --xpehh for my data and found xpehh value from the repective program. I am interested in P-Value is there any way to calculate the same?

Column Header info

Hello, I just need information about the column header produced by norm --ihs. From parts of the manual and different questions, I can tell that it's

WINSTART WINSTOP. N.SNPS.WINDOW probably.fraction.gt.threshold. PERCENTILE ????

What is the last column that ranges in values from 1-4? Is the fourth column interpreted correctly as being the fraction greater than critical threshold? Apologies ahead of time if this is written elsewhere.

COMMAND RUN:
norm --ihs --bp-win --files ./chr1.ihs.out ./chr2.ihs.out ./chr3.ihs.out ./chr4.ihs.out ./chr5.ihs.out ./chr6.ihs.out ./chr7.ihs.out ./chr
8.ihs.out ./chr9.ihs.out ./chr10.ihs.out ./chr11.ihs.out ./chr12.ihs.out ./chr13.ihs.out ./chr14.ihs.out ./chr15.ihs.out ./chr16.ihs.out .
/chr17.ihs.out ./chr18.ihs.out ./chr19.ihs.out ./chr20.ihs.out ./chr21.ihs.out ./chr22.ihs.out

OUTPUT
[ rsltSelscan]$ head chr2.ihs.out.100bins.norm.100kb.windows
1 100001 15 0 100 1
100001 200001 302 0.013245 100 2
200001 300001 180 0.00555556 100 2

Normalize XPEHH with overlap window

Hi,
I have calculated the unstandardized XPEHH for each sites, I want to normalize the XPEHH value with overlapped windows, but the norm can just do non-overlap normalize. How can I do this with overlap windows?

best

no standardized IHS output?

Hi again Zack,

Sorry to bug you again but I've gone through the manual and the paper and am still looking for clarification.

The manual says <locusID > <physicalPos > <’1’ freq > <ihh1 > <ihh0 > < unstandardized iHS > is the format of the iHS output. From the paper and manual, I would have expected a standardized iHS column as well. Is this not explicity output by selscan?

Thanks!
Julia

szpiech / selscan Goto Github PK

selscan's People

Contributors

Stargazers

Watchers

Forkers

selscan's Issues

Sample file

Recommend Projects

Recommend Topics

Recommend Org