Git Product home page Git Product logo

smart-phase's People

Contributors

dependabot[bot] avatar paulhager avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

smart-phase's Issues

VCF output question

Hello,

I am glad I found your software for the advantages it offers over other phasing methods.

I am using it to phase de novo variants on a human cohort in multiple families. As far as I can tell, the software is running as expected, and outputs a TSV file successfully for each of my trios.

However - the main thing I'm interested in doing is determining which de novo variants can be assigned to their maternal or paternal origin.

My first question is - in the VCF output, should I expect it to conform to the convention where the first allele is paternal and the second allele is maternal?

Second... I'm not sure exactly what is happening but the VCF output only gets the SPGT format tag assigned to chromosome 1. Here is the code I'm running:

java -jar smartPhase.jar \ -a ${vcf} \ -p ${kid} \ -g validation_positions.500bp.up_and_downstream.bed \ -r ${bam1},${bam2} \ -m 30,30 \ -d ${family}.ped \ -o ${kid}.tsv \ -x \ -t \ -vcf \ -c 0.1 \ --physical-phasing 2>&1 | tee ${kid}.smartphase.log

A sample ped file:
1004_21 1004021 0 0 1 0 1004_21 1004022 0 0 2 0 1004_21 1004023 1004021 1004022 2 0

I have confirmed that the BAM files have coverage in all regions of the genome, and that the VCF listing all genotype variants for the trio also includes an even distribution... I'm quite puzzled by all this.

Anyway I'm not sure that anything can be done at your end because there are no errors thrown... the VCF gets written as expected, but just doesn't have the phasing information anywhere except chromosome 1.

The VCF is from GATK HC, and the BAM files are produced according to GATK best practices.

Let me know if you have any idea what might be happening!

Thanks,
Matt

Error when running in trio mode

Hello,

I am trying to run smart-phase in explorative mode for a trio. My script is below:
java -jar ~/tools/smart-phase/smartPhase.jar \ -g ~/tools/smart-phase/BED/allGeneRegionsCanonical.HG19.GRCh37.bed \ -a 489.vcf.gz \ -p 489_TR0011 \ -r 489_TR0011.bam \ -m 30 \ -d 489.ped -o 489_output.tsv -x -t -vcf -c 0.1

And I get this error:

------------------ INTERVAL CONTIG: 1 INTERVAL START: 58946390 INTERVAL END: 59012446 Filtered Variants found in interval: 2 Exception in thread "main" java.lang.NullPointerException at smartPhase.SmartPhase.trioPhase(SmartPhase.java:1998) at smartPhase.SmartPhase.main(SmartPhase.java:651)

When I remove the -t flag (i.e. I don't run in trio mode), I don't get errors. I'm not sure why it doesn't work in trio mode. Help would be appreciated!

Using smart-phase on non-human samples

Hi,

I am attempting to use smart-phase on VCFs from cat and dog tumour samples. However, when I do so, I receive a "contig not in header" error message for both species. After some investigation, I noticed that this error occurs any time smart-phase encounters a non-human chromosome name (e.g. A2 in cat, or 23 in dog). Would you possibly be willing to adapt your code so that it can handle any chromosome name? Any help with this issue would be greatly appreciated. I will put my error messages and their associated files below:

Cat
Command:
smart-phase -g CATD0161a_vs_CATD0161b.muts.ids.smartphase.bed -p CATD0161b -r CATD0161b.sample.dupmarked.bam -m0 -x -o CATD0161a_vs_CATD0161b.phased -a CATD0161a_vs_CATD0161b.muts.ids.vcf.gz

Error:

Exception in thread "main" java.lang.Exception: Exception while reading bed file: 
Cannot add interval A2:51272440-51272441	-	., contig not in header
	at smartPhase.SmartPhase.main(SmartPhase.java:462)

BED file (head):

A2	51272439	51272441
A2	56410588	56410591
A2	72116584	72116586
A2	77899144	77899148
A2	158214164	158214166
A3	35335863	35335865
A3	45021035	45021037
AANG04000872.1	25070	25072
AANG04001062.1	45355	45357
AANG04002062.1	1472	1474

Dog
Command:
smart-phase -g DD1461a_vs_DD1461b.adjacent_snvs.bed -p DD1461b -r DD1461a.sample.dupmarked.bam -m0 -x -o DD1461a_vs_DD1461b.phased -a DD1461a_vs_DD1461b.muts.ids.vcf.gz

Error:

Exception in thread "main" java.lang.Exception: Exception while reading bed file: 
Cannot add interval 23:49633551-49633552	-	., contig not in header
	at smartPhase.SmartPhase.main(SmartPhase.java:462)

BED file (head):

1	102322808	102322810
1	112405800	112405802
14	11037964	11037966
16	13842451	13842453
18	8273010	8273012
20	44582709	44582711
23	49633550	49633552
27	5548500	5548502
30	7409939	7409941
35	19083488	19083490

Many thanks,
Bailey

suspicious haplotypes derived from smart-phase

Dear Tim:

I now put a very small test data at https://github.com/jielab/ukb/tree/master/data: 1687346.vcf.gz and 1687346.bam.

And I run the following smart-phase command to phase the two APOE SNPs:
java -jar /mnt/d/software_lin/smartPhase.jar -a 1687346.vcf.gz -p 1687346 -g apoe.b38.bed -r 1687346.bam -m 60 -x -vcf -c 0.1 -o 1687346.tsv

The phased haplotype in the output file 1687346_sp.vcf.gz (tag "SPGT") is the same as the original haplotype (tag "GT").

However, I actually found that the original GT is wrong. When I load the BAM file into IGV, as you see below, the first SNP is C/T, while the second SNP is C/C (i.e., non-polymorphic). Therefore, the GT for the second SNP rs7412 should be "1|1" instead of "1|0".

123

After I changed the GT for the second SNP rs7412 to "1|1" for the 1687346.vcf.gz file, and re-run the above command. Somehow, this time there is no genotype data in the output file 1687346_sp.vcf.gz. So, it seems that smart-phase consider the fixed VCF file is impossible.

Can you please kindly run this testing dataset on your side and let me know if I did something wrong here?

Thank you & best regards,
Jie

Impossible to open VCF v4.2 files..

Hi, thanks for your tool, seems useful.
However, I could not open any VCF file... Have the following error:
Exception in thread "main" java.lang.Exception: While parsing filtered variants, tried to split using '\t' but failed. Please ensure your file is tab-seperated, or ends in .csv if it is comma separated.
at smartPhase.FilteredVariantReader.<init>(FilteredVariantReader.java:94)

My VCF is generated by GATK and the version is:
##fileformat=VCFv4.2
Thanks for your help.

Exception in thread "main" java.lang.Exception: While parsing filtered variants

Hi, there:

I am trying to use smart-Phase to phase a BAM file, without using any family data or existing phased data that exist in VCF. The following command works:
java -jar smartPhase.jar
-g ./BED/allGeneRegionsCanonical.HG19.GRCh37.bed
-a ./UseCase/CEU_UseCase.vcf.gz -p NA12878
-r ./UseCase/CEU_UseCase.bam -m 60
-o output.tsv

And I got the following output.
ACTRT2-1-2938045-2939467 1-2938924-T-G 1-2938989-G-A 2 0.8081305751896183
CLDN19-1-43198763-43205925 1-43201534-G-A 1-43201614-C-T 1 0.8050514884016794
PCSK4-19-1481426-1490407 19-1487195-G-A 19-1490285-G-A 4 0.0
Denovo count: 0
Cis count: 1
Avg dist between cis: 80.0
Trans count: 1
Avg dist between trans: 65.0
Newblock count: 1
Contradiction count: 0

I saw the CEU_UseCase.vcf.gz file include 6 SNPs for 3 samples. The 6 SNPs are the same as those 6 SNPs listed in the above output file. So, my understanding is that the VCF is used only to list the SNPs to be phased, correct? If I excluded the -a ./UseCase/CEU_UseCase.vcf.gz -p NA12878 option from the above command, it won't run. Instead, it simply gave me a usage: Welcome to SmartPhase! page. Isn't there a simpler way to run smart-Phase, without using a VCF file? Can't I simply specify a list of SNPs in a text file?

I found there are > 30,000 SNPs in the BED file. I guess this file listed all genes in human genome. Again, I have no idea why we need a file like this, if I simply want to phase 6 SNPs as used in the example. If i excluded the -g ./BED/allGeneRegionsCanonical.HG19.GRCh37.bed option from the above command, I got an error of Exception in thread "main" java.lang.Exception: While parsing filtered variants

Your reply would be greatly appreciated!

Best regards,
Jie

Exception in thread "main" java.lang.NullPointerException

Hi, there:

Please refer to the screenshot below. I got an error message like this.
Can you please kindly let me know what I might have done wrong?

111

Also, do I must use the "-a XXX.vcf" option to specify which variants to phase? How about if I use the "-g file.bed" option instead?

Thank you & best regards,
Jie

New Tagged Release

Hi there,

Would it be possible to have a tagged release of smart-phase as it is at the moment please, or is tehre one planned in the very near future?
We're looking to make a Docker image including smart-phase so linking to a specific release would be preferable to download so the version doesn't have the potential to change under our feet when building.

Regards,

David

Possible length issue with indels in TSV file

Hello,

I wanted to alert you to another potential file writing issue - I was just reading some of the TSVs into R and I could not figure out for the life of me why it wasn't happy with the number of columns, as everything looked fine. It happened to be another edge case I think... in which only two columns were written because the length of an indel took up the whole line? See below for the problematic example. It only happened in one of my ~50 individuals I think.

All the best!
Matt

1-1247563-1248563       chr1-1248091-GTGGGCAGCCCTGGGAGGCTGGACTGAGGGAGGCTGGACTTCCCACTCAGGCCTACACGCAGGAAAA-G
1-1378594-1379594       chr1-1378635-C-T        chr1-1378792-C-T        1       1.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.