odelaneau / shapeit5 Goto Github PK

View Code? Open in Web Editor NEW

55.0 4.0 8.0 1.24 GB

Segmented HAPlotype Estimation and Imputation Tool

Home Page: https://odelaneau.github.io/shapeit5/

License: MIT License

Makefile 0.84% C++ 63.89% C 0.15% Shell 23.60% Dockerfile 0.18% R 9.98% Python 1.36%

bioinformatics haplotype phasing sequencing

shapeit5's People

Stargazers

Watchers

Forkers

barneyhill 23andme-jaredo rwk-unil eyherabh hihg-um victorasua belowlab hihg-um

shapeit5's Issues

ERROR: Opening [info/target.haploid.txt]: file format not supported by HTSlib

Good morning,
I am trying to use your program using the data in the test folder. Following the commands present in phasing chromosome X data (link) and using the container quay.io/biocontainers/shapeit5:1.0.0--h0c8ee15_0 I get the following error:

SHAPEIT5_phase_common --input array/target.haploid.bcf --reference info/target.haploid.txt --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

ERROR: Opening [info/target.haploid.txt]: file format not supported by HTSlib

Can you help me figuring out which is the problem?
Thank you

SHAPEIT5_ligate Segmentation fault

Hi !
I'm developing Shapeit5 on nf-core now and I currently on a Segmentation fault when using SHAPEIT5_ligate.
The problem is that this error arise randomly and I don't quite understand why for the moment.
You can find all the code here.
But the command is simply as follow:

SHAPEIT5_ligate \                                                                                                                                                    
                \                                                                                                                                                               
               --input all_files.txt \                                                                                                                                          
               --thread 2 \                                                                                                                                                     
               --output input.vcf.gz

Where all_files.txt as just two files generated by SHAPEIT5_phase_common.

Do you have an idea which kind of error is behind ?

`[W::bcf_record_check] Bad BCF record` error when `phase_rare` indexes malformed output

phase_rare is failing while attempting to index the output, with errors like the following printed to stderr across all chunks:

[W::bcf_record_check] Bad BCF record at 28101822: Invalid CONTIG id -1

Upon trying to index the output bcf myself, it's clear the file is malformed, with the same error happening. Looking in the output, I have isolate the place in the file where it first errors (truncated from ~ 36k samples):

chr22   28101479        .       C       A       .       .       AC=1;AN=72722   GT:PP   0|0:.   0|0:.
[W::bcf_record_check] Bad BCF record at 28101822: Invalid CONTIG id -1
chr22   28101482        .       C       T       .       .       AC=4;AN=72722   GT:PP   0|0:.   0|0:.

This was obtained by running bcftools view [output_file].bcf | less -S, searching for 28101 and scrolling down until I saw the error.

Looking at the input file, there are no variants in between:

chr22	28101479	.	C	A	614.31	PASS	AC=1;AF=1.3751e-05;AN=72722;BaseQRankSum=-0.788;DP=1232241;ExcessHet=3.0103;FS=1.137;InbreedingCoeff=0.013;MLEAC=1;MLEAF=1.348e-05;MQ=60;MQRankSum=0;QD=13.65;ReadPosRankSum=0.206;SOR=0.904;VQSLOD=13.52;culprit=MQRankSum	GT	0/0	0/0
chr22	28101482	.	C	T	1831.32	PASS	AC=4;AF=5.5004e-05;AN=72722;BaseQRankSum=1.65;DP=1231891;ExcessHet=3.0115;FS=2.381;InbreedingCoeff=0.0061;MLEAC=4;MLEAF=5.392e-05;MQ=60;MQRankSum=0;QD=15.52;ReadPosRankSum=0.228;SOR=0.482;VQSLOD=13.29;culprit=MQRankSum	GT	0/0	0/0

This was obtained by running bcftools view -r chr22:28101479-28101482 [input_file].bcf > dagnostic.vcf

The full command for phase_rare was as follows:

SHAPEIT5_phase_rare   --input ADSP.WGS.SNPs.compact_filter.combined.chr22.filtered.bcf   --map resources/shapeit/maps/chr22.b38.gmap.gz   --output ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.bcf   --thread 42   --log ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.log   --scaffold ADSP.WGS.SNPs.compact_filter.combined.chr22.phased_common.bcf   --scaffold-region chr22:39664334-46825259   --input-region chr22:42325256-46325257

I got the following log (the HTSlib error was printed to stderr):


[SHAPEIT5] Phase_rare (phase rare variants onto a haplotype scaffold)
  * Authors       : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected] & [email protected]
  * Version       : 5.1.1 / commit = 5.1.1 / release = 2023-05-16
  * Run date      : 31/05/2023 - 16:24:16

Files:
  * Unphased data : [ADSP.WGS.SNPs.compact_filter.combined.chr22.filtered.bcf]
  * Scaffold data : [ADSP.WGS.SNPs.compact_filter.combined.chr22.phased_common.bcf]
  * Genetic Map   : [resources/shapeit/maps/chr22.b38.gmap.gz]
  * Output data   : [ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.bcf]
  * Output LOG    : [ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.log]

Parameters:
  * Seed    : 15052011
  * Threads : 42 threads
  * PBWT    : [depth = 2,2 / modulo = 0.1 / mac = 2 / mdr = 0.1]
  * HMM     : [Ne = 15000 / Recombination rates given by genetic map]

Parsing specified genomic regions
  * Scaffold region  [chr22:39664334-46825259]
  * Input region  [chr22:42325256-46325257]

Reading genotype data
  * BCF scanning (10.49s)
      + Variant sites: [#scaffold=61709 | #rare=472804 | #all=534513]
      + 364951 rare variants in buffer regions excluded
      + 58173 variants found in both files [priority to scaffold]
  * HAP allocation [#scaffold=61709 / #samples=36361] (0.00s)
  * Genotype set allocation [#rare=472804 / #samples=36361] (0.04s)
  * Plain VCF/BCF parsing (111.70s)
      + Scaffold : [0/0=1953131181 0/1=194380741 1/1=96289027]
      + Rare     : [0/0=17157202585 0/1=1964700 1/1=947608 ./.=31511351]

Initializing sparse genotypes
  * 235020 missing genotypes imputed at monomorphic sites
  * Genotype set transpose V2I (0.50s)

Setting up genetic map
  * GMAP parsing [n=45141] (0.02s)
  * cM interpolation [s=7236 / i=527277] (0.04s)
  * Region length [7160576 bp / 11.9 cM]
  * HMM parameters [Ne=15000 / Error=0.0001]
  * V2H transpose (2.30s)

PBWT pass
  * PBWT initialization [#eval=61690 / #select=120] (0.00s)
  * IBD2 constraints [#inds=95 / #pairs=99] (27.08s)
  * PBWT forward selection (12.85s)
  * PBWT backward selection (14.08s)
      + #states=333.86+/-117.28
      + #collisions = 19466214 / #pushes = 15440439 / rate = 44.23%

HMM computations
  * Processing (3414.52s)
  * Genotype set transpose merge I2V [n=372364] (15.51s)
  * Solving summary:
      + #Hets=1964700 / Phased by HMM=1592336, PED=0, SING=372364
      + #Miss=31511351 / Imputed by HMM=31276331, PED=0, MONO=235020

Finalization:
  * BCF writing [Compressed / N=36361 / L=534513] (162.53s)
  * Indexing [ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.bcf]

ERROR: Fail to index file

Error when trying to export graph

I'm trying to get shapeit5 to export graphs for subsequent determination of recombination sites in trios using duohmm. However, I get the following error:

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 04/06/2023 - 16:32:08

Files:
  * Input         : [input.bcf.gz]
  * Genetic Map   : [PAR2.gmap.gz]
  * Output        : [output.graph]
  * Output format : [graph]

Parameters:
  * Seed    : 15052011
  * Threads : 64 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 5cM / depth = 8 / modulo = 0.1 / mac = 5 / missing = 0.1]
  * HMM     : [window = 5cM / Ne = 15000 / Recombination rates given by genetic map]
  * FILTERS : [snp only = 0 / MAF = 0.001]

Reading genotype data:
  * VCF/BCF scanning done (0.02s)
      + Variants [#sites=3018 / region=chrX:153925834-154259566]
         - 5290 sites removed in main panel [below MAF threshold]
      + Samples [#target=3202 / #reference=0]
  * VCF/BCF parsing done (0.09s)
      + Genotypes [0/0=86.805%, 0/1=8.267%, 1/1=4.854%, ./.=0.075%, 0|1=0.000%]

Setting up genetic map:
  * GMAP parsing [n=179] (0.02s)
  * cM interpolation [s=178 / i=2840] (0.00s)
  * Region length [319083 bp / 0.4 cM]
  * HMM parameters [Ne=15000 / Error=0.0001 / #rare=0]

....

Finalization:
  * HAP solving (0.04s)
  * HAP update (0.02s)
  * H2V transpose (0.01s)
terminate called after throwing an instance of 'boost::wrapexcept<boost::bad_any_cast>'
  what():  boost::bad_any_cast: failed conversion using boost::any_cast
Aborted (core dumped)

I am using the PAR2 region as a short test region, but the error applies to any chromosomal region I test. Can this bug be fixed?

Segmentation faultng... | phase_common

for i in {22..22};
do
shapeit5 phase_common \
          --input ${WD}/0601/ABCD.chr${i}.vcf.gz \
          --map ${RF}/SHAPEIT5MAPS/chr${i}.b37.gmap.gz \
          --region ${i} \
          --reference ${RF}/1000Gp3hg37/chr${i}.vcf.gz \
          --thread 8 \
          --mcmc-iterations 10b,1p,1b,1p,1b,1p,1b,1p,10m \
          --pbwt-depth 8 \
          --log ${WD}/0602/ABCD.chr${i}.phased.log \
          --output ${WD}/0602/ABCD.chr${i}.phased.xcf \
          --output-format "bh"
done

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.0 / commit = 744384b / release = 2023-03-31
  * Run date      : 22/04/2023 - 02:39:53

Files:
  * Input         : [/home/fivankov/DRIVE/ABCD/dADHD/0601/ABCD.chr22.vcf.gz]
  * Reference     : [/home/fivankov/DRIVE/ReferenceFiles/1000Gp3hg37/chr22.vcf.gz]
  * Genetic Map   : [/home/fivankov/DRIVE/ReferenceFiles/SHAPEIT5MAPS/chr22.b37.gmap.gz]
  * Output        : [/home/fivankov/DRIVE/ABCD/dADHD/0602/ABCD.chr22.phased.xcf]
  * Output format : [bh]
  * Output LOG    : [/home/fivankov/DRIVE/ABCD/dADHD/0602/ABCD.chr22.phased.log]

Parameters:
  * Seed    : 15052011
  * Threads : 8 threads
  * MCMC    : 27 iterations [10b + 1p + 1b + 1p + 1b + 1p + 1b + 1p + 10m]
  * PBWT    : [window = 4cM / depth = 8 / modulo = 0.1 / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data:
Segmentation faultng ...

issue with makefile for phase_common

make is able to run all commands up to building the final executable, but for some reason excludes the -lcurl and -lcrypto flags, even though both are listed in the DYN_LIBS variable. I'm not sure why this is happening, but I am able to successfully compile the final executable simply by adding these two flags to the command that make attempts to use.

What make tries to run:

g++ -std=c++17 -O3 obj/bitmatrix.o obj/bitvector.o obj/conditioning_set_managment.o obj/conditioning_set_selection.o obj/conditioning_set_solve.o obj/genotype_set.o obj/haplotype_set.o obj/ibd2_tracks.o obj/variant_map.o obj/window_set.o obj/genotype_reader_managment.o obj/genotype_reader_reading.o obj/genotype_reader_scaning.o obj/gmap_reader.o obj/graph_writer.o obj/haploid_reader.o obj/haplotype_writer.o obj/pedigree_reader.o obj/main.o obj/haplotype_segment_double.o obj/haplotype_segment_single.o obj/genotype_builder.o obj/compute_job.o obj/genotype_build.o obj/genotype_managment.o obj/genotype_mendel.o obj/genotype_prune.o obj/genotype_sweep.o obj/hmm_parameters.o obj/variant.o obj/phaser_algorithm.o obj/phaser_finalise.o obj/phaser_initialise.o obj/phaser_management.o obj/phaser_parameters.o /net/snowwhite/home/beckandy/Tools/htslib/lib/libhts.a /usr/lib/x86_64-linux-gnu/libboost_iostreams.a /usr/lib/x86_64-linux-gnu/libboost_program_options.a -o bin/phase_common -lz -lpthread -lbz2 -llzma

What works:

g++ -std=c++17 -O3 obj/bitmatrix.o obj/bitvector.o obj/conditioning_set_managment.o obj/conditioning_set_selection.o obj/conditioning_set_solve.o obj/genotype_set.o obj/haplotype_set.o obj/ibd2_tracks.o obj/variant_map.o obj/window_set.o obj/genotype_reader_managment.o obj/genotype_reader_reading.o obj/genotype_reader_scaning.o obj/gmap_reader.o obj/graph_writer.o obj/haploid_reader.o obj/haplotype_writer.o obj/pedigree_reader.o obj/main.o obj/haplotype_segment_double.o obj/haplotype_segment_single.o obj/genotype_builder.o obj/compute_job.o obj/genotype_build.o obj/genotype_managment.o obj/genotype_mendel.o obj/genotype_prune.o obj/genotype_sweep.o obj/hmm_parameters.o obj/variant.o obj/phaser_algorithm.o obj/phaser_finalise.o obj/phaser_initialise.o obj/phaser_management.o obj/phaser_parameters.o /net/snowwhite/home/beckandy/Tools/htslib/lib/libhts.a /usr/lib/x86_64-linux-gnu/libboost_iostreams.a /usr/lib/x86_64-linux-gnu/libboost_program_options.a -o bin/phase_common -lz -lpthread -lbz2 -llzma -lcurl -lcrypto

Rare variants with reference scaffold

Hi there,

I just want to clarify something about the algorithm. When not using a reference panel, does SHAPEIT5 only draw haplotypes from samples that are shared between the scaffold and the main dataset? Or can it draw haplotypes from samples that are present in the scaffold but not in the main dataset?

In other words, if was trying to phase a vcf with 100 variant calls in samples [A,B,C] onto a scaffold of 50 variant calls from samples [A,B,C,D,E,F,G,H,I], would SHAPEIT5 use the scaffold information from [A,B,C,D,E,F,G,H,I] to determine the likely haplotypes? Or would it only use information from the records [A,B,C] in the scaffold, and discard records [D,E,F,G,H,I]?

Specifically, I'm wondering about this portion of the SHAPEIT5 paper:

For a specific rare variant, these conditioning haplotypes are chosen so that (1) they belong to samples being locally identical-by-descent (IBD) with the target sample and (2) they are polymorphic at the rare variant (that is, at least a few carry a copy of the minor allele). To comply with the first requirement, SHAPEIT5 uses a positional Burrows-Wheeler transform (PBWT) data structure[23](https://www.nature.com/articles/s41588-023-01415-w#ref-CR23) built on all the scaffold haplotypes at common variants.

Do these scaffold haplotypes need to be present in the input vcf to be considered for this process?

This seems particularly important if trying to phase small datasets with singleton alleles, since the singleton algorithm is based on "leverag[ing] IBD sharing patterns between haplotypes...", and learning IBD sharing patterns from out-of-sample haplotypes seems like it could help with that process.

Thanks!
-Joe

random_shuffle() not available in C++17

Hello again,

I ran into this build error when building phase rare tool.

g++ -std=c++17 -O3 -mavx2 -mfma -D__COMMIT_ID__=\"914aaf4\" -D__COMMIT_DATE__=\"2022-12-16\" -c src/containers/conditioning_set/conditioning_set_solve.cpp -o obj/conditioning_set_solve.o -Isrc -I/install/usr/local/include/htslib -I/install/usr/local/include
src/containers/conditioning_set/conditioning_set_solve.cpp:37:2: error: use of undeclared identifier 'random_shuffle'
        random_shuffle(A.begin(), A.end());
        ^
1 error generated.
make: *** [obj/conditioning_set_solve.o] Error 1

shapeit5/phase_rare/src/containers/conditioning_set/conditioning_set_solve.cpp

Line 37 in 6647bc3

random_shuffle(A.begin(), A.end());

and similar error with src/containers/conditioning_set/conditioning_set_selection.cpp.

shapeit5/phase_rare/src/containers/conditioning_set/conditioning_set_selection.cpp

Line 38 in 6647bc3

random_shuffle(A.begin(), A.end());

It seems the Makefile used C++17 compiler option

shapeit5/phase_rare/makefile

Line 2 in acda0f5

CXX=g++ -std=c++17

while random_shuffle() is removed in C++17 according to cppreference

I got around with the following patch

diff --git a/phase_rare/src/containers/conditioning_set/conditioning_set_selection.cpp b/phase_rare/src/containers/conditioning_set/conditioning_set_selection.cpp
index d92a97b..2adf58c 100644
--- a/phase_rare/src/containers/conditioning_set/conditioning_set_selection.cpp
+++ b/phase_rare/src/containers/conditioning_set/conditioning_set_selection.cpp
@@ -35,7 +35,10 @@ void conditioning_set::select(variant_map & V, genotype_set & G) {
        vector < int > R = vector < int > (n_haplotypes, 0);
        vector < int > M = vector < int > (depth_common * n_haplotypes, -1);
        iota(A.begin(), A.end(), 0);
-       random_shuffle(A.begin(), A.end());
+
+    std::random_device rd;
+    std::mt19937 g(rd());
+    std::shuffle(A.begin(), A.end(), g);

        //Select new sites at which to trigger storage
        vector < vector < int > > candidates = vector < vector < int > > (sites_pbwt_grouping.back() + 1);
diff --git a/phase_rare/src/containers/conditioning_set/conditioning_set_solve.cpp b/phase_rare/src/containers/conditioning_set/conditioning_set_solve.cpp
index 0217232..01684fc 100644
--- a/phase_rare/src/containers/conditioning_set/conditioning_set_solve.cpp
+++ b/phase_rare/src/containers/conditioning_set/conditioning_set_solve.cpp
@@ -34,7 +34,10 @@ void conditioning_set::solve(variant_map & V, genotype_set & G) {
        vector < int > D = vector < int > (n_haplotypes, 0);
        vector < int > R = vector < int > (n_haplotypes, 0);
        iota(A.begin(), A.end(), 0);
-       random_shuffle(A.begin(), A.end());
+
+    std::random_device rd;
+    std::mt19937 g(rd());
+    std::shuffle(A.begin(), A.end(), g);

        //Get cM positions of the scaffold sites
        vector < float > vs_cm = vector < float > (n_scaffold_variants, 0.0);

The git version is v1.0.0-beta-20-gacda0f5.

Please suggest if there are alternative ways to fix this. Thanks!

Best,
Zhuoyi

can pedigree information be fed into SHAPEIT5?

UKB WGS: possible to phase per independent block instead of per chromosome?

Hi, thanks for the great tool! On UKB WGS data, since concatenating all chunks per chromosome will produce an extremely large bcf, I wonder whether it is feasible to segment the hg38 into 1381 independent block by LDetect (by this paper) and do the phasing per block? The workflow will look like

for block in BLOCK: do
QC and concatenate all chunks in block;
phase_common in block;
####skip ligation step
phase_rare in block, no smaller chunk;
done

With the saaumption that 1)SHAPEIT5 supports multi-threading when analyzing one large chunk; 2) haplotype scaffolds of an independent block would be accurate, I think this workflow will perform similarly to the tutorial. Do you think this workflow make sense? Thanks for your help!

Question about reproducibility

Hi,

Thanks for developing such an excellent tool.

I noticed that if I want to make the phase_common results reproducible, I have to use a single thread (and keep the seed, although the default seed is fixed).
This was described in the SHAPEIT4's document as follows:

Reproducibility
Making reproducible runs can sometimes be useful. To do so, you need to specify the random generator seed using --seed and to use a single thread. Using multi-threading prevents reproducibility.

Regarding ligate and phase_rare, should I use a single thread to make it reproducible?

Although I want to use multiple threads to speed up, I am also concerned about reproducibility and its potential effects on my downstream analysis.

Best,

phase common failed to build

Hello,

I noticed in a recent change in phase common module, it seems the method haplotype_writer:: writeHaplotypes() has inconsistent signature between the declaration and definition.

shapeit5/phase_common/src/io/haplotype_writer.h

Line 46 in 5680ff4

void writeHaplotypes(std::string foutput);

shapeit5/phase_common/src/io/haplotype_writer.cpp

Line 39 in 5680ff4

void haplotype_writer::writeHaplotypes(string fname, double filter_maf) {

This made it failed to build phase common tool.

$ make
g++ -std=c++17 -O3 -mavx2 -mfma -D__COMMIT_ID__=\"acda0f5\" -D__COMMIT_DATE__=\"2022-12-09\" -c src/io/haplotype_writer.cpp -o obj/haplotype_writer.o -Isrc -I/usr/local/include/htslib -I/usr/local/include
src/io/haplotype_writer.cpp:39:24: error: out-of-line definition of 'writeHaplotypes' does not match any declaration in 'haplotype_writer'
void haplotype_writer::writeHaplotypes(string fname, double filter_maf) {
                       ^~~~~~~~~~~~~~~
1 error generated.
make: *** [obj/haplotype_writer.o] Error 1

The git version is v1.0.0-beta-20-gacda0f5.

Maybe I missed something. Please let me know if there is a workaround for this. Thanks!

Best,
Zhuoyi

Segmentation fault

Hello,

I am using the static binaries to run phase_common on a ~1 million bp locus on chromosome 9. I am using a cloud environment with 16 CPUs and 104 GB RAM.

When I run phase_common (./phase_common_static --input mt_chr9_serology_subset_locus_srwgs_v7.bcf --filter-maf 0.001 --region chr9 --map chr9.b38.gmap.gz --output tmp/target.scaffold.bcf --thread 16), I get the following output and error:

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 02/08/2023 - 20:06:51

Files:
  * Input         : [mt_chr9_serology_subset_locus_srwgs_v7.bcf]
  * Genetic Map   : [chr9.b38.gmap.gz]
  * Output        : [tmp/target.scaffold.bcf]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 16 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]
  * FILTERS : [snp only = 0 / MAF = 0.001]

Reading genotype data:
  * VCF/BCF scanning done (3.00s)
      + Variants [#sites=17931 / region=chr9]
         - 25623 sites removed in main panel [below MAF threshold]
      + Samples [#target=18929 / #reference=0]
  * VCF/BCF parsing done (13.97s)
      + Genotypes [0/0=63.313%, 0/1=7.923%, 1/1=3.768%, ./.=24.996%, 0|1=0.000%]

Setting up genetic map:
  * GMAP parsing [n=0] (0.00s)
bash: line 1:  2076 Segmentation fault      (core dumped) ./phase_common_static --input mt_chr9_serology_subset_locus_srwgs_v7.bcf --filter-maf 0.001 --region chr9 --map chr9.b38.gmap.gz --output tmp/target.scaffold.bcf --thread 16
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
/tmp/ipykernel_329/2112237477.py in <module>
----> 1 get_ipython().run_cell_magic('bash', '', './phase_common_static --input mt_chr9_serology_subset_locus_srwgs_v7.bcf --filter-maf 0.001 --region chr9 --map chr9.b38.gmap.gz --output tmp/target.scaffold.bcf --thread 16\n')

/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2460             with self.builtin_trap:
   2461                 args = (magic_arg_s, cell)
-> 2462                 result = fn(*args, **kwargs)
   2463             return result
   2464 

/opt/conda/lib/python3.7/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
    140             else:
    141                 line = script
--> 142             return self.shebang(line, cell)
    143 
    144         # write a basic docstring:

<decorator-gen-103> in shebang(self, line, cell)

/opt/conda/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

/opt/conda/lib/python3.7/site-packages/IPython/core/magics/script.py in shebang(self, line, cell)
    243             sys.stderr.flush()
    244         if args.raise_error and p.returncode!=0:
--> 245             raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
    246 
    247     def _run_script(self, p, cell, to_close):

CalledProcessError: Command 'b'./phase_common_static --input mt_chr9_serology_subset_locus_srwgs_v7.bcf --filter-maf 0.001 --region chr9 --map chr9.b38.gmap.gz --output tmp/target.scaffold.bcf --thread 16\n'' returned non-zero exit status 139.

Any suggestions on how to resolve this issue would be greatly appreciated.

Error phasing(phase_common):AC field is needed in file

Hi,

I'm using shapeit5 to phase HGDP chr14 VCF file and I get the following error:

Reading genotype data:
  * VCF/BCF scanning ...
ESC[31mERROR: ESC[0mAC field is needed in file

Not entirely sure, where the AC field is missing. Any help would be appreciated.

How to simulate SNP data

Hi,

I would like to know how you simulated your data ?
Do you have a tool or a script for doing that ?

Thanks a lot !

Getting the posterior probability distribution of haplotype estimations?

Hello! Thanks for developing such a great tool!

I was wondering if it would be possible to get the posterior probability distribution of each SNP's haplotype estimation. I am trying to calculate a haplotype frequency based on a set of phased SNPs. I read that phasing is done by sampling a posterior distribution P(D|H) based on an LSM, and I think it'd be beneficial to incorporate the probability of phasing into my calculation.

I understand the probability estimation is an intermediate step that you might not have outputted, but I was wondering if you had any insights on how I might be able to access this and / or if my thought makes sense to you.

Thank you so much for your time and efforts!

Best,
Emilia

Needs of AVX2

Hi,

As I'm working on the integration of SHAPEIT5 in nf-core, I was wondering which modules of SHAPEIT5 needs AVX2 to work.
Unfortunately their is no fallback possible in your code to run SHAPEIT5 and GLIMPSE2 on host without the AVX2 feature available.

So I added some test in the nf-core process to check if AVX2 is available or not, before running.

Could you tell me which functions needs AVX2 ?
phase_rare and phase_common seems to need it.

Thanks

Compiled static binaries of update?

Hi there @odelaneau, I'd like to rerun the 1000G phasing T2T with the new update. I'm having trouble compiling it on my computer. Could you provide static binaries, as you did for 1.0? Those worked very well for me.

Thanks,
Joe Lalli

Reference files

Hi,

I downloaded the reference dataset from 1000G [GRCh37]:

https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html

However, while running shapeit5 I got this error:
Opening 1000GP_Phase3_chr22.hap.gz: file format not supported by HTSlib.

Could you please help with this issue?
Thanks!
Anahit

Issue with the start

Hello. I am trying to mimic the process. However, there are a lot of steps to build from source code. If I choose the static version, should I just replace the command like ../phase_common/bin/SHAPEIT5_phase_common --input 10k/msprime.nodup.bcf --filter-maf 0.001 --output 10k/msprime.common.phased.bcf --region 1 --thread 8 bcftools index 10k/msprime.common.phased.bcf --threads 8 to ../phase_common_static --input 10k/msprime.nodup.bcf --filter-maf 0.001 --output 10k/msprime.common.phased.bcf --region 1 --thread 8 bcftools index 10k/msprime.common.phased.bcf --threads 8. Or what should I do make the file run?

using phase_common_static: loosing phase with bcftools convert

I'm having a problem with loss of phase after running bcftools 'convert --hapsample' on a bcf created via
phase_common_static --thread 12 --input chr19.19.filtered.vcf.gz --map map.txt.gz --output chr19.19.phased.bcf --region 19 --pbwt-depth 8 --pbwt-modulo 8 --log chr19.19.shapeit.log 2> chr19.19.shapeit.err

Unfortunately, the system won't let me upload the bcf file, but I think it can be recreated from any bop output from phase_common_static.

When I run this using the bcftools 1.15 (also tested on1.17-56-g402de74a)
bcftools convert --hapsample test6b test6.bcf
The phase is lost in the hap file

19 19:171762_T_C 171762 T C 0* 0* 0* 0*
19 19:209360_G_T 209360 G T 0* 0* 0* 0*
19 19:229783_G_A 229783 G A 1* 0* 1* 0*

When I convert that same bcf to vcf, the phase is preserved
test6b.vcf.txt

When I convert that vcf to bcf and convert to hapsample

    bcftools convert --output-type b -o test6b2.bcf test6.vcf
    bcftools convert --hapsample test6b2 test6b2.bcf

The phase is preserved in the hap file.

19 19:171762_T_C 171762 T C 0 0 0 0
19 19:209360_G_T 209360 G T 0 0 0 0
19 19:229783_G_A 229783 G A 1 0 1 0

When I try converting the original bcf to hap using bcftools 1.3.1 (based on this thread)
bcftools-1.3.1/bcftools convert --hapsample test6b131 test6.bcf
The phase is preserved in the hap file.

So, it seems the way phase_common_static creates phase in bcf is compatible with bcftools 1.3.1, but not the recent versions.(?) Or maybe it is a bcftools issue?

Assertion `GRvar_genotypes[vr][tidx].prob >= 0.0f' failed, when phasing rare alleles

Hello, @odelaneau ,

Thanks for the great work with shapeit5!!

I'm looking into implementing it in a nextflow pipeline (from modules made by @LouisLeNezet, thanks also to him!).
I have a problem when phasing rare variants with shapeit5 either 5.0 or 5.1.
Indeed the HMM computations are not done. Do you have any idea why this might be?

``
[SHAPEIT5] Phase_rare (phase rare variants onto a haplotype scaffold)

Authors : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
Contact : [email protected] & [email protected]
Version : 5.1.0 / commit = e4f5e58 / release = 2023-03-29
Run date : 04/04/2023 - 13:02:34

Files:

Unphased data : [split.chr18.vcf.gz]
Scaffold data : [chr18.ref.common.phased.vcf.gz]
Output data : [chr18_11000001_12000000.vcf.gz]

Parameters:

Seed : 15052011
Threads : 8 threads
PBWT : [depth = 2,2 / modulo = 0.1 / mac = 2 / mdr = 0.3]
HMM : [Ne = 15000 / Constant recombination rate of 1cM per Mb]

Parsing specified genomic regions

Scaffold region [chr18:10500001-12500000]
Input region [chr18:11000001-12000000]

Reading genotype data

BCF scanning ...
[W::hts_idx_load3] The index file is older than the data file: split.chr18.vcf.gz.tbi
[W::hts_idx_load3] The index file is older than the data file: chr18.ref.common.phased.vcf.gz.tbi
BCF scanning (35.15s)
- Variant sites: [#scaffold=37997 | #rare=13362 | #all=51359]
- 21696 rare variants in buffer regions excluded
- 13362 variants found in both files [priority to scaffold]
- 5265 multi-allelic variants excluded
HAP allocation [#scaffold=13362 / #samples=1842] (0.00s)
Genotype set allocation [#rare=37997 / #samples=1842] (0.00s)
BCF parsing ...
BCF parsing [100%]
Plain VCF/BCF parsing (36.04s)
- Scaffold : [0/0=23191904 0/1=821979 1/1=598921]
- Rare : [0/0=65436271 0/1=1925356 1/1=1326764 ./.=1302083]

Initializing sparse genotypes

0 missing genotypes imputed at monomorphic sites
Genotype set transpose V2I (0.05s)

Setting up genetic map

Region length [1499962 bp / 1.5 cM (assuming 1cM per Mb)]
HMM parameters [Ne=15000 / Error=0.0001]
V2H transpose (0.03s)

PBWT pass

PBWT initialization [#eval=10283 / #select=6] (0.00s)
IBD2 constraints [100%]
IBD2 constraints [#inds=0 / #pairs=0] (0.42s)
PBWT forward selection [100%]
PBWT forward selection (2.65s)
PBWT backward selection [100%]
PBWT backward selection (2.87s)
- #states=229.97+/-88.68
- #collisions = 17789 / #pushes = 70642 / rate = 79.88%

HMM computations

Processing [0%]
phase_rare_static: src/containers/genotype_set/genotype_set_phasing.cpp:63: void genotype_set::phaseLiAndStephens(unsigned int, unsigned int, aligned_vector32&, aligned_vector32&, std::vector&, float): Assertion `GRvar_genotypes[vr][tidx].prob >= 0.0f' failed.
``

How can I solve "makefile:135: recipe for target 'obj/haplotype_segment_double.o' failed"

Hello! I hope this message finds you well. I'm currently working on setting up Shapeit5 on a new server. The server is running Ubuntu 18.04 with gcc and g++ versions 7.5.0-3, htslib version 1.9, and BOOST version 1.81.

I'm encountering an error when I run 'make' in the phase_common directory. The error message is as follows:

[In file included from src/models/haplotype_segment_double.cpp:23:0:
src/models/haplotype_segment_double.h: In member function ‘bool haplotype_segment_double::TRANS_HAP()’:
src/models/haplotype_segment_double.h:403:10: error: ‘isnan’ was not declared in this scope
return (isnan(sumHProbs) || isinf(sumHProbs) || sumHProbs < std::numeric_limits::min());
^~~~~
src/models/haplotype_segment_double.h:403:10: note: suggested alternatives:
In file included from src/utils/otools.h:39:0,
from src/models/haplotype_segment_double.h:26,
from src/models/haplotype_segment_double.cpp:23:
/usr/include/c++/7/cmath:639:5: note: ‘std::isnan’
isnan(_Tp __x)
^~~~~
/usr/include/c++/7/cmath:618:3: note: ‘std::isnan’
isnan(float __x)
^~~~~
In file included from src/models/haplotype_segment_double.cpp:23:0:
src/models/haplotype_segment_double.h:403:30: error: ‘isinf’ was not declared in this scope
return (isnan(sumHProbs) || isinf(sumHProbs) || sumHProbs < std::numeric_limits::min());
^~~~~
src/models/haplotype_segment_double.h:403:30: note: suggested alternatives:
In file included from src/utils/otools.h:39:0,
from src/models/haplotype_segment_double.h:26,
from src/models/haplotype_segment_double.cpp:23:
/usr/include/c++/7/cmath:612:5: note: ‘std::isinf’
isinf(_Tp __x)
^~~~~
/usr/include/c++/7/cmath:591:3: note: ‘std::isinf’
isinf(float __x)
^~~~~
In file included from src/models/haplotype_segment_double.cpp:23:0:
src/models/haplotype_segment_double.h: In member function ‘bool haplotype_segment_double::TRANS_DIP_MULT()’:
src/models/haplotype_segment_double.h:421:10: error: ‘isnan’ was not declared in this scope
return (isnan(sumDProbs) || isinf(sumDProbs) || sumDProbs < std::numeric_limits::min());
^~~~~
src/models/haplotype_segment_double.h:421:10: note: suggested alternatives:
In file included from src/utils/otools.h:39:0,
from src/models/haplotype_segment_double.h:26,
from src/models/haplotype_segment_double.cpp:23:
/usr/include/c++/7/cmath:639:5: note: ‘std::isnan’
isnan(_Tp __x)
^~~~~
/usr/include/c++/7/cmath:618:3: note: ‘std::isnan’
isnan(float __x)
^~~~~
In file included from src/models/haplotype_segment_double.cpp:23:0:
src/models/haplotype_segment_double.h:421:30: error: ‘isinf’ was not declared in this scope
return (isnan(sumDProbs) || isinf(sumDProbs) || sumDProbs < std::numeric_limits::min());
^~~~~
src/models/haplotype_segment_double.h:421:30: note: suggested alternatives:
In file included from src/utils/otools.h:39:0,
from src/models/haplotype_segment_double.h:26,
from src/models/haplotype_segment_double.cpp:23:
/usr/include/c++/7/cmath:612:5: note: ‘std::isinf’
isinf(_Tp __x)
^~~~~
/usr/include/c++/7/cmath:591:3: note: ‘std::isinf’
isinf(float __x)
^~~~~
In file included from src/models/haplotype_segment_double.cpp:23:0:
src/models/haplotype_segment_double.h: In member function ‘bool haplotype_segment_double::TRANS_DIP_ADD()’:
src/models/haplotype_segment_double.h:439:10: error: ‘isnan’ was not declared in this scope
return (isnan(sumDProbs) || isinf(sumDProbs) || sumDProbs < std::numeric_limits::min());
^~~~~
src/models/haplotype_segment_double.h:439:10: note: suggested alternatives:
In file included from src/utils/otools.h:39:0,
from src/models/haplotype_segment_double.h:26,
from src/models/haplotype_segment_double.cpp:23:
/usr/include/c++/7/cmath:639:5: note: ‘std::isnan’
isnan(_Tp __x)
^~~~~
/usr/include/c++/7/cmath:618:3: note: ‘std::isnan’
isnan(float __x)
^~~~~
In file included from src/models/haplotype_segment_double.cpp:23:0:
src/models/haplotype_segment_double.h:439:30: error: ‘isinf’ was not declared in this scope
return (isnan(sumDProbs) || isinf(sumDProbs) || sumDProbs < std::numeric_limits::min());
^~~~~
src/models/haplotype_segment_double.h:439:30: note: suggested alternatives:
In file included from src/utils/otools.h:39:0,
from src/models/haplotype_segment_double.h:26,
from src/models/haplotype_segment_double.cpp:23:
/usr/include/c++/7/cmath:612:5: note: ‘std::isinf’
isinf(_Tp __x)
^~~~~
/usr/include/c++/7/cmath:591:3: note: ‘std::isinf’
isinf(float __x)
^~~~~
makefile:135: recipe for target 'obj/haplotype_segment_double.o' failed
make: *** [obj/haplotype_segment_double.o] Error 1 ]

I would greatly appreciate your assistance and guidance in resolving this issue. Thank you in advance for your help!

where to download liftovervcf_static?

Hi,
it seems that your release page is updated and the loftovervcf tool mentioned in the tutorial is not found in the release page. Could you update the release page for this tool? Also, sine I'm running on local machine instead of dnanexus, I wonder there would be executable rather than a docker image. Thanks for your help!

Installation error in make process

Hello, I used shapeit4 in last year, but I want to use shapeit5 also.
When I do shapeit5 installation process, fatal error popped up and cosuming my time...
I'm working with this version,

(base) jhk0709@R750xs:~/Public/230424/shapeit5/phase_rare$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.5 LTS
Release: 20.04
Codename: focal

and Centos also. (I will install shapeit5 in Centos later when this error solved and successfully installed.)

Make error occurred here..
/home/jhk0709/Public/230424/htslib/hfile_s3_write.c:859: undefined reference to curl_global_init' /usr/bin/ld: /home/jhk0709/Public/230424/htslib/hfile_s3_write.c:866: undefined reference to curl_share_init'
/usr/bin/ld: /home/jhk0709/Public/230424/htslib/hfile_s3_write.c:874: undefined reference to curl_share_setopt' /usr/bin/ld: /home/jhk0709/Public/230424/htslib/hfile_s3_write.c:875: undefined reference to curl_share_setopt'
/usr/bin/ld: /home/jhk0709/Public/230424/htslib/hfile_s3_write.c:876: undefined reference to curl_share_setopt' /usr/bin/ld: /home/jhk0709/Public/230424/htslib/hfile_s3_write.c:885: undefined reference to curl_version_info'
/usr/bin/ld: /home/jhk0709/Public/230424/htslib/hfile_s3_write.c:879: undefined reference to curl_share_cleanup' /usr/bin/ld: /home/jhk0709/Public/230424/htslib/hfile_s3_write.c:880: undefined reference to curl_global_cleanup'
/usr/bin/ld: /usr/local/lib/libhts.a(hfile_s3_write.o): in function s3_write_exit': /home/jhk0709/Public/230424/htslib/hfile_s3_write.c:831: undefined reference to curl_global_cleanup'
collect2: error: ld returned 1 exit status
make: *** [makefile:129: bin/phase_rare] Error 1

when I install with htslib, it doesn't process well so I check link all *c file - and same error message popped up!
(base) jhk0709@R750xs:~/Public/230424/htslib-1.17$ gcc hfile_s3_write.c -lcurl
hfile_s3_write.c:67:10: fatal error: config.h: No such file or directory
67 | #include <config.h>
| ^~~~~~~~~~
compilation terminated.

when I found the message for your installation guide, when the htslib successfully done with make process,
it has no libhts.a (so I make link with ln -s to /usr/local/lib/)
and CPATH to solve htslib and samtools.

What can I do for solve this make solution?

phase_common is removing all variants

Hello– I'm running phase_common and it is removing all my sites (says that they are multi-allelic) and then exits because there are no variants to phase. This is commit 5cc58e4 (4 Apr 2023). I also have phase_common from commit 1d357ac (2 Feb 2023) compiled and it successfully phases the same vcf. Any idea what's going on here?

Edit: I saw another post with the same problem where you had them post the output of bcftools view -G -H [vcf]. Here are the first few lines of that output (and I can attach the whole file if needed)

How are the chunking files made? Can I chunk by samples?

I need to chunk for some large VCF files, so I took a look at the chunking files at resources/chunks/b38/20cM/chunks_chr*.txt, but they don't make sense to me. It looks like column 2 is overlapping chunks and column 3 is non-overlapping, but the chunks don't appear to be 20 cM. They seem closer to 37 based on the map files provided.

Also, is there any reason not to chunk based on groups of samples if I have to do that type of chunking downstream?

ligater_initialise.cpp typo?

When looking at the log for ligate I noticed something that doesn't seem right. It appears that when using a pedigree, it prints the file name from the file listing the chunks to ligate rather than the pedigree file name. It looks like this is in ligater_initialise.cpp lines 43-59. Everything works fine, it just looks like it is printing the wrong file name.

cat TEST051.P.1.25.ligate.log

Files:
  * Input LIST     : [TEST051.P.1.25.common.chunks.txt]
  * Output VCF     : [test.bcf]
  * Index output   : [NO]
  * Output LOG    : [TEST051.P.1.25.ligate.log]

Parameters:
  * Seed           : [15052011]
  * #Threads       : [4]

Read filenames in [TEST051.P.1.25.common.chunks.txt]
  * #files = 2

Read filenames in [TEST051.P.1.25.common.chunks.txt] <-- should be file from --pedigree [FileName]
  * PED file parsing [n=630]

Whats the source of genetic map (GRCh38)?

Hi,
Thanks for your published a well prepared genetic map for GRCh38.
I am a bit confusing about this resources, as I know, there is a very old source of orginal hapmap genetic map (Its based on b35 and liftover to GRCh37) geneticMap-GRCh37, and 1kg project seems to be following this version (with some additional updates for some groups) 1000-genomes-genetic-maps. So there seem to be some different sources here. Could you please share some additional clarification about the source of the attached gentic map?

Moreover, if this file was gradually liftover from the original b35 version, can I further liftover this file if I wish to do some work based on a newer reference genome (such as CHM13)? Will this bring some negative effects?

Please feel free to correct me if I'm wrong.

Regrads
Zhang

phase_rare_static: Assertion `!isnan(GRvar_genotypes[vr][tidx].prob)' failed.

I ran into this when doing some testing and I haven't been able to figure out what the issue is so I figured I'd post. These are cows. Below is the beginning of the log for phase_common which runs fine.

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 18/05/2023 - 22:48:22

Files:
  * Input         : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT_INDIV/25/WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Genetic Map   : [/cluster/VAST/schnabeltest/GENETIC_MAP/9913_ars1.2/recombination_map.25.gmap]
  * Output        : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT/25/4.Common.WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Output format : [bcf]
  * Output LOG    : [TEST007.25.4.32.Common.log]

Parameters:
  * Seed    : 15052011
  * Threads : 32 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 1500 / Recombination rates given by genetic map]
  * FILTERS : [snp only = 0 / MAF = 0.002]

Reading genotype data:
  * VCF/BCF scanning done (39.92s)
      + Variants [#sites=198884 / region=25:19885090-25996793]
         - 62304 sites removed in main panel [multi-allelic]
         - 222099 sites removed in main panel [below MAF threshold]
      + Samples [#target=6320 / #reference=0]
  * VCF/BCF parsing done (31.16s)
      + Genotypes [0/0=88.448%, 0/1=5.162%, 1/1=5.277%, ./.=1.112%, 0|1=0.000%]

Setting up genetic map:
  * GMAP parsing [n=1165] (0.02s)
  * cM interpolation [s=138 / i=198746] (0.02s)
  * Region length [6111615 bp / 11.1 cM]
  * HMM parameters [Ne=1500 / Error=0.0001 / #rare=0]

Initializing data structures:
  * Impute monomorphic [n=0] (0.00s)
  * HAP update (5.75s)
  * H2V transpose (1.81s)
  * PBWT parameters auto setting : [modulo = 0.058 / depth = 5]
  * PBWT initialization [#eval=197995 / #select=192 / #chunk=1] (0.03s)
  * PBWT phasing sweep (37.69s)
  * Build genotype graphs [seg=21630267] (0.49s)

Below is the log from phase_rare

[SHAPEIT5] Phase_rare (phase rare variants onto a haplotype scaffold)
  * Authors       : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected] & [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 18/05/2023 - 23:32:17

Files:
  * Unphased data : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT_INDIV/25/WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Scaffold data : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT/25/4.Common.WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Genetic Map   : [/cluster/VAST/schnabeltest/GENETIC_MAP/9913_ars1.2/recombination_map.25.gmap]
  * Output data   : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT/25/4.Common.WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Output LOG    : [TEST007.25.4.32.Rare.log]

Parameters:
  * Seed    : 15052011
  * Threads : 32 threads
  * PBWT    : [depth = 2,2 / modulo = 0.1 / mac = 2 / mdr = 0.1]
  * HMM     : [Ne = 1500 / Recombination rates given by genetic map]

Parsing specified genomic regions
  * Scaffold region  [25:19885090-25996793]
  * Input region  [25:20385092-25385099]

Reading genotype data
  * BCF scanning (17.22s)
      + Variant sites: [#scaffold=198884 | #rare=182603 | #all=381487]
      + 39496 rare variants in buffer regions excluded
      + 198884 variants found in both files [priority to scaffold]
      + 62304 multi-allelic variants excluded
  * HAP allocation [#scaffold=198884 / #samples=6320] (0.00s)
  * Genotype set allocation [#rare=182603 / #samples=6320] (0.02s)
  * Plain VCF/BCF parsing (35.08s)
      + Scaffold : [0/0=1122420439 0/1=65933761 1/1=68592680]
      + Rare     : [0/0=1143404645 0/1=793882 1/1=178000 ./.=9674433]

Initializing sparse genotypes
  * 174774 missing genotypes imputed at monomorphic sites
  * Genotype set transpose V2I (0.09s)

Setting up genetic map
  * GMAP parsing [n=1165] (0.22s)
  * cM interpolation [s=138 / i=381349] (0.02s)
  * Region length [6111615 bp / 11.1 cM]
  * HMM parameters [Ne=1500 / Error=0.0001]
  * V2H transpose (1.62s)

PBWT pass
  * PBWT initialization [#eval=198884 / #select=112] (0.01s)
  * IBD2 constraints [#inds=22 / #pairs=22] (15.05s)
  * PBWT forward selection (5.87s)
  * PBWT backward selection (6.06s)
      + #states=388.05+/-203.65
      + #collisions = 1604575 / #pushes = 4058727 / rate = 71.67%

HMM computations

The full error printed to screen but not the log is:

phase_rare_static: src/containers/genotype_set/genotype_set_phasing.cpp:66: void genotype_set::phaseLiAndStephens(unsigned int, unsigned int, aligned_vector32<float>&, aligned_vector32<float>&, std::vector<unsigned int>&, float): Assertion `!isnan(GRvar_genotypes[vr][tidx].prob)' failed.
Aborted (core dumped)

I'm in the process of normalizing these files to remove the multiallelic issue but any other suggestions for things to look into would be appreciated.
Bob

Silent imputation of missing variants

Hello,

As an experiment, I've tried take the vcf of the draft human pangenome, use bcftools +setGT to unphase the pangenome vcf, and then rephase the vcf using different reference panels. I am using shapeit5's switch tool to assess phasing accuracy. I am using the re-phased dataset as estimated vcf, and the original pangenome vcf as the verification vcf. The advantage of this method is the elimination of genotyping as a source of error, as the input vcf was generated from the verification vcf.

Switch, however, is detecting sporatic genotyping errors. These errors are occurring at sites with a ./1 genotype.

Variants with a combined indel/snp, like so:

Ref:
AATCGTCTGTC
Sample:
AA------GTC
AATTGTCTGTC

After using bcftools norm to split multiallelic sites, the pangenome VCF represents this region as:
chr1 2 ATTGTCT A 1|0
chr1 4 T C .|1

Shapeit seems to be interpreting the .|1 call as a missing allele, and imputing the genotype at the site as 0/0. I don't think that is functioning as intented.

What is the best way to handle sites like this? If I used the atomize option of bcftools norm, I could represent the deletion allele as "*". Would shapeit recognize this?

Thanks!

phase_common problem.

Hi there, I am studying shapeit5 and impute5 and I have a question.
I successfully performed phasing using shapeit4 on macOS and completed phasing using chunks for SNP arrays.

As I mentioned last time, I have been making efforts to use the shapeit5 version and have finally installed shapeit5 today by creating an environment with ubuntu:20.04 using Docker on macOS (after several twists and turns).
Then, I used PLINK to convert a bed file to a vcf file for each chromosome, and converted it to gz format using bgzip and indexed it using bcftools.
After that, I proceeded with phasing using options similar to those used in shapeit4, but I keep getting the following error:
"ERROR: No variants to be phased!"
Below is the phasing log file.
How can I solve this problem?

P.S. I am experiencing the same error even without using chunks.

Phase set information

I would like to phase and impute my dataset with shapeit5. I originally planned on using shapeit4, incorporating read-supported phase sets as calculated by whatshap4. (See #13 from https://odelaneau.github.io/shapeit4/#documentation). I have a diverse WGS dataset, and being able to phase singletons/rare variants would be beneficial.

Does Shapeit5 use phase sets, like Shapeit4? If so, is there anything I need to do to ensure it uses that information?

No rule to make target error

make: *** No rule to make target 'src/containers/bitvector.h', needed by 'obj/genotype_builder.o'. Stop.

test/test.sh :: phase_rare test missing input files

Hello,

I tried to run the test.sh test suite and phase_rare test fail due to lack of input file

index: failed to open 10k/msprime.rare.test1.bcf

regards

Eric

Avg phaseQ value in usage document

Hi Olivier,

I found a potential inconsistency between the Avg phaseQ value in the usage and the actual values observed in the logs.

The Documentation states, "We usually expect values above 0.8 or even 0.9." However, I noticed that my actual log values are much larger (over 90).

Upon examining the code, the phaseQ calculation appears to produce values in the range of 0 to 99.

I suspect the description of the phaseQ value range might be "above 80 or even 90" instead of "0.8 or even 0.9."
Is my interpretation correct?

I'm using v5.1.0.

Many thanks again.

Best,
Masaru

How to fix 'make" problem

Hi. I have used shapeit4, and have been using it well without any problems.
In order to use another data set this time, I access the homepage of shapeit4 and see the message to use shapeit5, so I am installing it.
I modified the path of HTSSRC, BOOST in each tool's makefile file, and replaced the makefile of all tools. Then, when executing the "make" command, the following error occurs.

"make: *** No rule to make target src/containers/bitvector.h', needed by obj/haplotype_writer.o'. Stop."
or
"make -C phase_common
make[1]: *** No rule to make target src/containers/bitvector.h', needed by obj/haplotype_writer.o'. Stop.
make: *** [phase_common] Error 2"

There was no problem using shapeit4, but shapeit5 has not been installed since.
I'm using the Macbook Pro Ventura version.
How can we solve this problem?

Installation using docker

hi，dear Delaneau.
Sorry to bother you, could you please instruct me about the installation method of this version of docker? After downloading and decompressing shapeit5_v5.1.1.docker.tar.gz, I don't know what to do next, I am a novice in this aspect, I hope to get your help!

phase_rare might be filling INFO/AC incorrectly

For example try:

../SHAPEIT5_phase_common --input 10k/msprime.nodup.bcf --filter-maf 0.001  --output 10k/msprime.common.phased.bcf --region  1:4000001-5000000 --thread 8 
bcftools index 10k/msprime.common.phased.bcf --threads 8

#chunk1
../SHAPEIT5_phase_rare --input-plain 10k/msprime.nodup.bcf --scaffold 10k/msprime.common.phased.bcf --output 10k/msprime.rare.chr1_1_1500000.bcf --scaffold-region 1:4000000-5000000 --input-region 1:4500000-4600000 --thread 8
bcftools index 10k/msprime.rare.chr1_1_1500000.bcf --threads 8

I am seeing:

[jaredo@download1 test]$ bcftools view -H -G 10k/msprime.rare.chr1_1_1500000.bcf  | head
1	4500006	.	0	1	.	.	AC=446
1	4500015	.	0	1	.	.	AC=2
1	4500042	.	0	1	.	.	AC=8
1	4500111	.	0	1	.	.	AC=1
1	4500157	.	0	1	.	.	AC=1
1	4500174	.	0	1	.	.	AC=1
1	4500191	.	0	1	.	.	AC=4
1	4500388	.	0	1	.	.	AC=1
1	4500394	.	0	1	.	.	AC=1
1	4500474	.	0	1	.	.	AC=1
[jaredo@download1 test]$ bcftools +fill-tags 10k/msprime.rare.chr1_1_1500000.bcf -- -t AC | bcftools view -H -G | head
1	4500006	.	0	1	.	.	AC=223
1	4500015	.	0	1	.	.	AC=2
1	4500042	.	0	1	.	.	AC=8
1	4500111	.	0	1	.	.	AC=1
1	4500157	.	0	1	.	.	AC=1
1	4500174	.	0	1	.	.	AC=1
1	4500191	.	0	1	.	.	AC=4
1	4500388	.	0	1	.	.	AC=1
1	4500394	.	0	1	.	.	AC=1
1	4500474	.	0	1	.	.	AC=1

I think what is happening is that the variants in the scaffold are having their allele count doubled.

Maybe it is some interaction here:

$ find phase_rare/ -name '*.cpp' -exec grep -Hn -i count_alt {} \;
phase_rare//src/io/haplotype_writer.cpp:88:			int count_alt = 0;
phase_rare//src/io/haplotype_writer.cpp:96:					count_alt += 2 * major_allele;
phase_rare//src/io/haplotype_writer.cpp:104:					count_alt -= 2 * major_allele;
phase_rare//src/io/haplotype_writer.cpp:105:					count_alt += a0+a1;
phase_rare//src/io/haplotype_writer.cpp:112:					count_alt += a0+a1;
phase_rare//src/io/haplotype_writer.cpp:115:					count_alt += a0+a1;
phase_rare//src/io/haplotype_writer.cpp:119:			bcf_update_info_int32(hdr, rec, "AC", &count_alt, 1);

phase_common_static：ERROR: Opening [1Z.vcf]: unknown error

hi，dear Delaneau.
Sorry to bother you, I have a problem with phase_common that bothers me.
as follows:

[root@localhost LD-Decay]# docker run shapeit5_2023-03-23_a4a1818 phase_common_static -I 1Z.vcf -R 1 -O shapeitphase_1Z.vcf --thread 6
[SHAPEIT5] phase_common (jointly phase multiple common markers)

Author : Olivier DELANEAU, University of Lausanne
Contact : [email protected]
Version : 5.1.0 / commit = 492781e / release = 2023-04-13
Run date : 15/05/2023 - 14:17:53
Files:
Input : [1Z.vcf]
Output : [shapeitphase_1Z.vcf]
Output format : [bcf]
Parameters:
Seed : 15052011
Threads : 6 threads
MCMC : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
PBWT : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
HMM : [window = 4cM / Ne = 15000 / Constant recombination rate of 1cM per Mb]
Reading genotype data:
[E::hts_open_format] Failed to open file "1Z.vcf" : No such file or directory
ERROR: Opening [1Z.vcf]: unknown error

I'm pretty sure 1Z.vcf is in the current running directory. But these errors still occur.
I still have this problem when I try to add absolute paths.
How can I solve this problem?
Looking forward to your reply!

phase_common error when output provided without a correct extension

If I don't include an extension in the output, I get some weird behavior from SHAPEIT5:

$ phase_common --input input.bcf --region chr1:0-100000 --output outp

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 18/07/2023 - 12:00:00

Files:
  * Input         : [input.bcf]
  * Output        : [outp]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 1 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Constant recombination rate of 1cM per Mb]
...
...
...
Finalization:
  * HAP solving (0.03s)
  * HAP update (0.02s)
  * H2V transpose (0.00s)
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 4)
Aborted (core dumped)

I believe that the problem arises from xcftools/common/src/utils/xcf.h line:

} else if (hts_fname.size() > 3 && hts_fname.substr(hts_fname.size()-5) == "vcf") {

where hts_fname.size()-5 underflows as the length of hts_fname is only 4 here

Similarly, I find suboptimal the behavior of SHAPEIT5 of throwing an error when not providing the correct extension:

$ phase_common --input input.bcf --region chr1:0-100000 --output output

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 18/07/2023 - 12:00:00

Files:
  * Input         : [input.bcf]
  * Output        : [output]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 1 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Constant recombination rate of 1cM per Mb]
...
...
...
Finalization:
  * HAP solving (0.15s)
  * HAP update (0.06s)
  * H2V transpose (0.01s)

ERROR: Filename extension of [output] not recognized

Where the error comes from xcftools/common/src/utils/xcf.h which is called by shapeit5/phase_common/src/io/haplotype_writer.cpp. Ideally the error should be issued before the whole phasing computation is performed so that the user can correct the problem without delay. I also find it odd that an error is issued at all as it seems from the initial line Output format : [bcf] that SHAPEIT5 had decided that the output would be bcf regardless of the extension

multi-allelic sites defaults supported？

Hi,
Thanks for development shapeit5, I know that shapeit is a widely recognized phased software, but earlier versions of shapeit such as shapeit2 don't seem to accept multi-allelic sites. Has the lastest version of shapeit5 added this support?

Regrads
Zhang

Rare phasing error: `Assertion V.vec_full[vt]->type == VARTYPE_SCAF failed`

Hi,

I'm trying to phase rare variants, which goes well for the greatest part. I used the files and instructions you provided in the documentation and GitHub. As mentioned, the majority works very well but some rare phasing chunks fail, as I show in this example below. Do you have any idea why this might be?

[SHAPEIT5] Phase_rare (phase rare variants onto a haplotype scaffold)
  * Authors       : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected] & [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 30/06/2023 - 17:19:27

Files:
  * Unphased data : [chr22.exome_array.full.sorted.bcf]
  * Scaffold data : [chr22.exome_array.shapeit5.common_0.001.bcf]
  * Genetic Map   : [shapeit_maps/chr22.b38.gmap.gz]
  * Output data   : [chr22.exome_array.22_28090430_33113736.shapeit5.rares_0.001.bcf]
  * Output LOG    : [chr22.exome_array.22_28090430_33113736.shapeit5.rares_0.001.log]

Parameters:
  * Seed    : 15052011
  * Threads : 32 threads
  * PBWT    : [depth = 2,2 / modulo = 0.1 / mac = 2 / mdr = 0.1]
  * HMM     : [Ne = 15000 / Recombination rates given by genetic map]

Parsing specified genomic regions
  * Scaffold region  [chr22:27590423-33613738]
  * Input region  [chr22:28090430-33113736]

Reading genotype data
[E::bgzf_read_block] Invalid BGZF header at offset 706792049
[E::bgzf_read] Read block operation failed with error 6 after 384803 of 755137 bytes
  * BCF scanning (23.38s)
      + Variant sites: [#scaffold=2546 | #rare=5581 | #all=8127]
      + 4715 rare variants in buffer regions excluded
      + 322 variants found in both files [priority to scaffold]
  * HAP allocation [#scaffold=2546 / #samples=377567] (0.00s)
  * Genotype set allocation [#rare=5581 / #samples=377567] (0.00s)
[E::bgzf_read_block] Invalid BGZF header at offset 706792049
[E::bgzf_read] Read block operation failed with error 6 after 168546 of 755137 bytes
phase_rare_static: src/io/genotype_reader/genotype_reader_reading.cpp:50: void genotype_reader::readGenotypes(): Assertion `V.vec_full[vt]->type == VARTYPE_SCAF' failed.

The commands for common and rare phasing I use are as follows:

# phasing common variants
phase_common_static --input chr22.exome_array.full.sorted.bcf --map shapeit_maps/chr22.b38.gmap.gz --output chr22.exome_array.shapeit5.common_0.001.bcf --thread 64 --log chr22.exome_array.shapeit5.common_0.001.log --filter-maf 0.001 --region chr22

# phasing rare variants
phase_rare_static --input chr22.exome_array.full.sorted.bcf --scaffold chr22.exome_array.shapeit5.common_0.001.bcf --map shapeit_maps/chr22.b38.gmap.gz --output chr22.exome_array.22_28090430_33113736.shapeit5.rares_0.001.bcf --log chr22.exome_array.22_28090430_33113736.shapeit5.rares_0.001.log --scaffold-region chr22:27590423-33613738 --input-region chr22:28090430-33113736 --thread 32

Thanks in advance!

ERROR: Opening [NC_053450.1.vcf.gz]: unknown error

Hello,

I am getting an issue where my unphased vcf is in the current directory, but shapeit is saying it's not there: ERROR: Opening [NC_053450.1.vcf.gz]: unknown error
[E::hts_open_format] Failed to open file "NC_053450.1.vcf.gz" : No such file or directory

I am using the following command: docker run shapeit5_2023-05-05_d6ce1e2 phase_common_static --input NC_053450.1.vcf.gz --region NC_053450.1 -O NC_053450.1.phased

I have been unable to compile from source or to use the precompiled static binaries.

Please advise on what I should do, thank you.

convert :: to build or not to build ?

Hello,

shapeit5 makefile does not build convert project

is this a feature ?

one ca build convert by make -C convert

shall I propose convert to our users ?

regards

Eric

tutorials/ukb_wgs: What is qc_metrics_graphtyper_v2.7.1_qc.tab.gz ?

Hi all, thank you for shapeit.

In https://odelaneau.github.io/shapeit5/docs/tutorials/ukb_wgs/#quality-control :

Unless I'm wrong, I cannot see a description of the file

/mnt/project/Bulk/Whole\ genome\ sequences/Whole\ genome\ GraphTyper\ joint\ call\ pVCF/QC/qc_metrics_graphtyper_v2.7.1_qc.tab.gz

what is that file please ?

Thanks!

Genetic map file for T2T genome (CHM13)

Hi,

Thanks for this wonderful tool! I was trying to use Shapeit5 for phasing a set of samples processed based on the new released T2T human genome (CHM13). Could you please help pointing the right genetic mapping files? I can only find b37 and b38 genomes in the folder.

Thanks very much for your help!

Hang

Phase_common: AC field is needed in file

I am running phase_common on a BCF file (as well as variants to exclude in another BCF) acquired from an Axiom array.

While running, it gives an error stating that AC field is needed in file for each of the chromosomes. A sample of this is included below.

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 03/07/2023 - 15:32:31

Files:
  * Input         : [PATH/output/CANPREDDICT.FINAL_GENOTYPING_QC_2021_03_25.AxiomGT1.MoChA.chrX.bcf]
  * Reference     : [PATH/GRCh37/ALL.chrX.phase3_integrated.20130502.genotypes.bcf]
  * Genetic Map   : [PATH/output/genetic_map.chrX.txt]
  * Output        : [PATH/output/CANPREDDICT.FINAL_GENOTYPING_QC_2021_03_25.AxiomGT1.MoChA.chrX.pgt.bcf]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 4 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data:

ESC[31mERROR: ESC[0mAC field is needed in file

Since I am still fairly new to genotyping pipelines, I am unfamiliar with what the AC field is and how I can make sure it is present in my file (as well as which file (aka the input BCF or one of the references/genetic maps) needs it).

Would appreciate any advice, thank you!

Phasing WES Workflow

Hi Oliver,

Really great tool you've made here -- love the paper as well.

I have a question about the best strategy to phase data our group has generated. We currently have 33K WES samples, 7K WGS samples and about 15K genotypes (that correspond to the sequenced samples).

What should be our workflow in carrying out this phasing? I know in your tutorial you merged and subsetted to overlapping individuals in UKBB, however in our case I fear losing power by restricting ourselves to only half the individuals in our sequenced cohort.

Could you advise please?

Does phase_common use phase sets in WGS-derived VCF files?

Hello, Thank you for providing this wonderful tool. I am currently working with VCF files that have phased genotypes on certain variant sites when they are close. The majority of variants in the VCF files are unphased, while only a small portion are phased by GATK. I have noticed that shapeit4 can utilize the phase sets from the PS field in VCF files. Since phase_common has replaced shapeit4, I am wondering if it can also make use of this phasing set information.

odelaneau / shapeit5 Goto Github PK

shapeit5's People

Stargazers

Watchers

Forkers

shapeit5's Issues

Recommend Projects

Recommend Topics

Recommend Org