paulhager / smart-phase Goto Github PK
View Code? Open in Web Editor NEWA comprehensive and intelligent clinical phasing tool
License: GNU General Public License v3.0
A comprehensive and intelligent clinical phasing tool
License: GNU General Public License v3.0
Hello,
I am glad I found your software for the advantages it offers over other phasing methods.
I am using it to phase de novo variants on a human cohort in multiple families. As far as I can tell, the software is running as expected, and outputs a TSV file successfully for each of my trios.
However - the main thing I'm interested in doing is determining which de novo variants can be assigned to their maternal or paternal origin.
My first question is - in the VCF output, should I expect it to conform to the convention where the first allele is paternal and the second allele is maternal?
Second... I'm not sure exactly what is happening but the VCF output only gets the SPGT format tag assigned to chromosome 1. Here is the code I'm running:
java -jar smartPhase.jar \ -a ${vcf} \ -p ${kid} \ -g validation_positions.500bp.up_and_downstream.bed \ -r ${bam1},${bam2} \ -m 30,30 \ -d ${family}.ped \ -o ${kid}.tsv \ -x \ -t \ -vcf \ -c 0.1 \ --physical-phasing 2>&1 | tee ${kid}.smartphase.log
A sample ped file:
1004_21 1004021 0 0 1 0 1004_21 1004022 0 0 2 0 1004_21 1004023 1004021 1004022 2 0
I have confirmed that the BAM files have coverage in all regions of the genome, and that the VCF listing all genotype variants for the trio also includes an even distribution... I'm quite puzzled by all this.
Anyway I'm not sure that anything can be done at your end because there are no errors thrown... the VCF gets written as expected, but just doesn't have the phasing information anywhere except chromosome 1.
The VCF is from GATK HC, and the BAM files are produced according to GATK best practices.
Let me know if you have any idea what might be happening!
Thanks,
Matt
Hello,
I am trying to run smart-phase in explorative mode for a trio. My script is below:
java -jar ~/tools/smart-phase/smartPhase.jar \ -g ~/tools/smart-phase/BED/allGeneRegionsCanonical.HG19.GRCh37.bed \ -a 489.vcf.gz \ -p 489_TR0011 \ -r 489_TR0011.bam \ -m 30 \ -d 489.ped -o 489_output.tsv -x -t -vcf -c 0.1
And I get this error:
------------------ INTERVAL CONTIG: 1 INTERVAL START: 58946390 INTERVAL END: 59012446 Filtered Variants found in interval: 2 Exception in thread "main" java.lang.NullPointerException at smartPhase.SmartPhase.trioPhase(SmartPhase.java:1998) at smartPhase.SmartPhase.main(SmartPhase.java:651)
When I remove the -t flag (i.e. I don't run in trio mode), I don't get errors. I'm not sure why it doesn't work in trio mode. Help would be appreciated!
Hi,
I am attempting to use smart-phase on VCFs from cat and dog tumour samples. However, when I do so, I receive a "contig not in header" error message for both species. After some investigation, I noticed that this error occurs any time smart-phase encounters a non-human chromosome name (e.g. A2 in cat, or 23 in dog). Would you possibly be willing to adapt your code so that it can handle any chromosome name? Any help with this issue would be greatly appreciated. I will put my error messages and their associated files below:
Cat
Command:
smart-phase -g CATD0161a_vs_CATD0161b.muts.ids.smartphase.bed -p CATD0161b -r CATD0161b.sample.dupmarked.bam -m0 -x -o CATD0161a_vs_CATD0161b.phased -a CATD0161a_vs_CATD0161b.muts.ids.vcf.gz
Error:
Exception in thread "main" java.lang.Exception: Exception while reading bed file:
Cannot add interval A2:51272440-51272441 - ., contig not in header
at smartPhase.SmartPhase.main(SmartPhase.java:462)
BED file (head):
A2 51272439 51272441
A2 56410588 56410591
A2 72116584 72116586
A2 77899144 77899148
A2 158214164 158214166
A3 35335863 35335865
A3 45021035 45021037
AANG04000872.1 25070 25072
AANG04001062.1 45355 45357
AANG04002062.1 1472 1474
Dog
Command:
smart-phase -g DD1461a_vs_DD1461b.adjacent_snvs.bed -p DD1461b -r DD1461a.sample.dupmarked.bam -m0 -x -o DD1461a_vs_DD1461b.phased -a DD1461a_vs_DD1461b.muts.ids.vcf.gz
Error:
Exception in thread "main" java.lang.Exception: Exception while reading bed file:
Cannot add interval 23:49633551-49633552 - ., contig not in header
at smartPhase.SmartPhase.main(SmartPhase.java:462)
BED file (head):
1 102322808 102322810
1 112405800 112405802
14 11037964 11037966
16 13842451 13842453
18 8273010 8273012
20 44582709 44582711
23 49633550 49633552
27 5548500 5548502
30 7409939 7409941
35 19083488 19083490
Many thanks,
Bailey
Dear Tim:
I now put a very small test data at https://github.com/jielab/ukb/tree/master/data: 1687346.vcf.gz and 1687346.bam.
And I run the following smart-phase command to phase the two APOE SNPs:
java -jar /mnt/d/software_lin/smartPhase.jar -a 1687346.vcf.gz -p 1687346 -g apoe.b38.bed -r 1687346.bam -m 60 -x -vcf -c 0.1 -o 1687346.tsv
The phased haplotype in the output file 1687346_sp.vcf.gz (tag "SPGT") is the same as the original haplotype (tag "GT").
However, I actually found that the original GT is wrong. When I load the BAM file into IGV, as you see below, the first SNP is C/T, while the second SNP is C/C (i.e., non-polymorphic). Therefore, the GT for the second SNP rs7412 should be "1|1" instead of "1|0".
After I changed the GT for the second SNP rs7412 to "1|1" for the 1687346.vcf.gz file, and re-run the above command. Somehow, this time there is no genotype data in the output file 1687346_sp.vcf.gz. So, it seems that smart-phase consider the fixed VCF file is impossible.
Can you please kindly run this testing dataset on your side and let me know if I did something wrong here?
Thank you & best regards,
Jie
Hi, thanks for your tool, seems useful.
However, I could not open any VCF file... Have the following error:
Exception in thread "main" java.lang.Exception: While parsing filtered variants, tried to split using '\t' but failed. Please ensure your file is tab-seperated, or ends in .csv if it is comma separated.
at smartPhase.FilteredVariantReader.<init>(FilteredVariantReader.java:94)
My VCF is generated by GATK and the version is:
##fileformat=VCFv4.2
Thanks for your help.
Hi, there:
I am trying to use smart-Phase to phase a BAM file, without using any family data or existing phased data that exist in VCF. The following command works:
java -jar smartPhase.jar
-g ./BED/allGeneRegionsCanonical.HG19.GRCh37.bed
-a ./UseCase/CEU_UseCase.vcf.gz -p NA12878
-r ./UseCase/CEU_UseCase.bam -m 60
-o output.tsv
And I got the following output.
ACTRT2-1-2938045-2939467 1-2938924-T-G 1-2938989-G-A 2 0.8081305751896183
CLDN19-1-43198763-43205925 1-43201534-G-A 1-43201614-C-T 1 0.8050514884016794
PCSK4-19-1481426-1490407 19-1487195-G-A 19-1490285-G-A 4 0.0
Denovo count: 0
Cis count: 1
Avg dist between cis: 80.0
Trans count: 1
Avg dist between trans: 65.0
Newblock count: 1
Contradiction count: 0
I saw the CEU_UseCase.vcf.gz file include 6 SNPs for 3 samples. The 6 SNPs are the same as those 6 SNPs listed in the above output file. So, my understanding is that the VCF is used only to list the SNPs to be phased, correct? If I excluded the -a ./UseCase/CEU_UseCase.vcf.gz -p NA12878 option from the above command, it won't run. Instead, it simply gave me a usage: Welcome to SmartPhase! page. Isn't there a simpler way to run smart-Phase, without using a VCF file? Can't I simply specify a list of SNPs in a text file?
I found there are > 30,000 SNPs in the BED file. I guess this file listed all genes in human genome. Again, I have no idea why we need a file like this, if I simply want to phase 6 SNPs as used in the example. If i excluded the -g ./BED/allGeneRegionsCanonical.HG19.GRCh37.bed option from the above command, I got an error of Exception in thread "main" java.lang.Exception: While parsing filtered variants
Your reply would be greatly appreciated!
Best regards,
Jie
Hi, there:
Please refer to the screenshot below. I got an error message like this.
Can you please kindly let me know what I might have done wrong?
Also, do I must use the "-a XXX.vcf" option to specify which variants to phase? How about if I use the "-g file.bed" option instead?
Thank you & best regards,
Jie
Hi there,
Would it be possible to have a tagged release of smart-phase as it is at the moment please, or is tehre one planned in the very near future?
We're looking to make a Docker image including smart-phase so linking to a specific release would be preferable to download so the version doesn't have the potential to change under our feet when building.
Regards,
David
Hello,
I wanted to alert you to another potential file writing issue - I was just reading some of the TSVs into R and I could not figure out for the life of me why it wasn't happy with the number of columns, as everything looked fine. It happened to be another edge case I think... in which only two columns were written because the length of an indel took up the whole line? See below for the problematic example. It only happened in one of my ~50 individuals I think.
All the best!
Matt
1-1247563-1248563 chr1-1248091-GTGGGCAGCCCTGGGAGGCTGGACTGAGGGAGGCTGGACTTCCCACTCAGGCCTACACGCAGGAAAA-G
1-1378594-1379594 chr1-1378635-C-T chr1-1378792-C-T 1 1.0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.