tkzeng / pangolin Goto Github PK
View Code? Open in Web Editor NEWPangolin is a deep-learning method for predicting splice site strengths.
License: GNU General Public License v3.0
Pangolin is a deep-learning method for predicting splice site strengths.
License: GNU General Public License v3.0
Pangolin provides both maximum scores and minimum scores. However, determining the cutoff value is crucial. We need to establish the specific values that indicate the significance of a splice variant for our purposes. Since I'm conducting whole genome sequencing (WGS), filtering out numerous variants is necessary. Could you guide me on where to find the cutoff scores?
Hi, I am trying to predit the splice site of an RNA virus artificially expressed in host genome. Because the organism I studied is not model organism, there is no VCF file. The pangolin seem to must use a vcf file as the positional arguments. Could pangolin predict the splice site ab initio?
Hi,
I had some issues finding the right versions for pangolin to run.
It worked by running:
conda install python=3.6
pip3 install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cpu
conda install -c conda-forge pyvcf
pip install gffutils biopython pandas pyfastx
Here are the conda and pip dependencies:
https://gist.github.com/MartGro/dfe979e79800a210faf0b85b43db62b0
Hello,
Is the training code for Pangolin available?
Hello,
I used to be able to use the Google collab notebook just fine, but with the latest updates from a couple of weeks ago, it doesn't seem to work anymore (even when I use the new db file gencode.v38lift37.annotation.Ensembl_canonical.db
). I would appreciate it if you could help me figure this one out!
The Brca example dataset doesn't return any predictions at all. This is what the output VCF file looks like:
##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM POS ID REF ALT QUAL FILTER INFO
chr17 41276135 . TTAA GTCT . . Pangolin=
chr17 41276135 . T G . . Pangolin=
chr17 41276135 . T C . . Pangolin=
chr17 41276135 . T A . . Pangolin=
chr17 41276134 . T G . . Pangolin=
chr17 41276134 . T C . . Pangolin=
chr17 41276134 . T A . . Pangolin=
chr17 41276133 . C T . . Pangolin=
chr17 41276133 . C G . . Pangolin=
chr17 41276133 . C A . . Pangolin=
chr17 41276132 . A T . . Pangolin=
chr17 41276132 . A G . . Pangolin=
chr17 41276132 . A C . . Pangolin=
chr17 41276131 . A T . . Pangolin=
chr17 41276131 . A G . . Pangolin=
chr17 41276131 . A C . . Pangolin=
chr17 41276130 . G T . . Pangolin=
chr17 41276130 . G C . . Pangolin=
chr17 41276130 . G A . . Pangolin=
chr17 41276129 . T G . . Pangolin=
chr17 41276129 . T C . . Pangolin=
chr17 41276129 . T A . . Pangolin=
chr17 41276128 . A T . . Pangolin=
chr17 41276128 . A C . . Pangolin=
chr17 41276127 . A T . . Pangolin=
chr17 41276127 . A G . . Pangolin=
chr17 41276127 . A C . . Pangolin=
chr17 41276126 . C T . . Pangolin=
chr17 41276126 . C G . . Pangolin=
chr17 41276126 . C A . . Pangolin=
chr17 41276125 . C T . . Pangolin=
chr17 41276125 . C G . . Pangolin=
chr17 41276125 . C A . . Pangolin=
chr17 41276124 . T G . . Pangolin=
chr17 41276124 . T C . . Pangolin=
chr17 41276124 . T A . . Pangolin=
The other problem I am having is that Pangolin is complaining about the genome positions I am actually interested in (not just the example dataset) not being in a gene body. My input VCF file is the following:
##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr12 52082542 G1A G A . . .
chr12 52082542 G1T G T . . .
chr12 52082542 G1C G C . . .
chr12 52082543 T2A T A . . .
chr12 52082543 T2G T G . . .
chr12 52082543 T2C T C . . .
chr12 52082576 T35A T A . . .
chr12 52082576 T35G T G . . .
chr12 52082576 T35C T C . . .
These are point mutations in the SCN8A gene. These mutations fall inside the gene according to the annotations, but Pangolin returns the following error:
Using GPU
[Line 30] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 31] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 32] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 33] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 34] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 35] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 36] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 37] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 38] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
Dear Team,
We are trying to run the Pangolin tool with one of the GiaB (Genome in a Bottle) WGS sample VCF file. Is there a system configuration recommendation for running Pangolin efficiently for such a large sample ? In other words; Kindly let us know, for running Pangolin for a larger sample, what system/computing server configuration would you recommend ?
Thanks in advance.
Siva
Hi! I have a question whether the prediciton performance of 5'ss usage and 3'ss usage are same? I think the 3ss usage is more diffcult to predict. So if is ok, please offer your test set. I will be very appreciate about it!
Hi Tony,
Apologies for bothering you again with this. I was running the latest version of pangolin on some previous exon I already did a while back, and interestingly, I get very different results.
These are the results I obtain after using pangolin with default parameters in early October:
##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM POS ID REF ALT QUAL FILTER INFO
10 90770511 A2C A C . . Pangolin=ENSG00000026103.22_1|-2:0.0|-1:-0.12
10 90770511 A2G A G . . Pangolin=ENSG00000026103.22_1|-3:0.0|-1:-0.14
10 90770511 A2T A T . . Pangolin=ENSG00000026103.22_1|-1:0.18|-7:-0.0
10 90770512 T3A T A . . Pangolin=ENSG00000026103.22_1|-8:0.0|-2:-0.05
10 90770512 T3C T C . . Pangolin=ENSG00000026103.22_1|-8:0.0|-2:-0.12
10 90770512 T3G T G . . Pangolin=ENSG00000026103.22_1|1:0.0|-2:-0.11
10 90770513 C4A C A . . Pangolin=ENSG00000026103.22_1|4:0.0|-3:-0.15
10 90770513 C4G C G . . Pangolin=ENSG00000026103.22_1|4:0.0|-3:-0.1
10 90770513 C4T C T . . Pangolin=ENSG00000026103.22_1|-50:-0.0|-3:-0.18
10 90770514 C5A C A . . Pangolin=ENSG00000026103.22_1|-4:0.08|3:-0.0
10 90770514 C5G C G . . Pangolin=ENSG00000026103.22_1|-4:0.05|3:-0.0
10 90770514 C5T C T . . Pangolin=ENSG00000026103.22_1|-18:0.0|-4:-0.26
10 90770515 A6C A C . . Pangolin=ENSG00000026103.22_1|-5:0.02|-5:-0.01
10 90770515 A6G A G . . Pangolin=ENSG00000026103.22_1|-5:0.18|2:-0.0
10 90770515 A6T A T . . Pangolin=ENSG00000026103.22_1|13:0.0|-5:-0.06
10 90770516 G7A G A . . Pangolin=ENSG00000026103.22_1|2:0.0|-6:-0.15
10 90770516 G7C G C . . Pangolin=ENSG00000026103.22_1|-8:0.0|-6:-0.06
10 90770516 G7T G T . . Pangolin=ENSG00000026103.22_1|17:0.0|-6:-0.19
10 90770517 A8C A C . . Pangolin=ENSG00000026103.22_1|-13:0.0|-7:-0.04
10 90770517 A8G A G . . Pangolin=ENSG00000026103.22_1|0:0.0|-7:-0.19
10 90770517 A8T A T . . Pangolin=ENSG00000026103.22_1|44:0.0|-7:-0.09
10 90770518 T9A T A . . Pangolin=ENSG00000026103.22_1|-8:0.02|-8:-0.0
10 90770518 T9C T C . . Pangolin=ENSG00000026103.22_1|-8:0.06|-1:-0.0
10 90770518 T9G T G . . Pangolin=ENSG00000026103.22_1|-8:0.03|-8:-0.01
10 90770519 C10A C A . . Pangolin=ENSG00000026103.22_1|-11:0.0|-9:-0.07
10 90770519 C10G C G . . Pangolin=ENSG00000026103.22_1|-9:0.06|-15:-0.0
10 90770519 C10T C T . . Pangolin=ENSG00000026103.22_1|-11:0.0|-9:-0.07
10 90770520 T11A T A . . Pangolin=ENSG00000026103.22_1|-10:0.23|41:-0.0
10 90770520 T11C T C . . Pangolin=ENSG00000026103.22_1|-10:0.28|41:-0.0
10 90770520 T11G T G . . Pangolin=ENSG00000026103.22_1|-10:0.4|41:-0.0
This is what I obtain after running pangolin on the same exon today:
##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM POS ID REF ALT QUAL FILTER INFO
chr10 90770511 A2C A C . . Pangolin=ENSG00000026103.23_1|22:0.0|-1:-0.14|Warnings:
chr10 90770511 A2G A G . . Pangolin=ENSG00000026103.23_1|22:0.0|-1:-0.1|Warnings:
chr10 90770511 A2T A T . . Pangolin=ENSG00000026103.23_1|6:0.01|-50:0.0|Warnings:
chr10 90770512 T3A T A . . Pangolin=ENSG00000026103.23_1|6:0.0|-2:-0.05|Warnings:
chr10 90770512 T3C T C . . Pangolin=ENSG00000026103.23_1|21:0.0|-2:-0.12|Warnings:
chr10 90770512 T3G T G . . Pangolin=ENSG00000026103.23_1|1:0.0|-2:-0.12|Warnings:
chr10 90770513 C4A C A . . Pangolin=ENSG00000026103.23_1|4:0.0|-3:-0.17|Warnings:
chr10 90770513 C4G C G . . Pangolin=ENSG00000026103.23_1|11:0.0|-3:-0.09|Warnings:
chr10 90770513 C4T C T . . Pangolin=ENSG00000026103.23_1|-16:0.0|-3:-0.21|Warnings:
chr10 90770514 C5A C A . . Pangolin=ENSG00000026103.23_1|4:0.0|-50:0.0|Warnings:
chr10 90770514 C5G C G . . Pangolin=ENSG00000026103.23_1|10:0.0|-50:0.0|Warnings:
chr10 90770514 C5T C T . . Pangolin=ENSG00000026103.23_1|-37:0.0|-4:-0.27|Warnings:
chr10 90770515 A6C A C . . Pangolin=ENSG00000026103.23_1|18:0.0|-50:0.0|Warnings:
chr10 90770515 A6G A G . . Pangolin=ENSG00000026103.23_1|-48:0.0|-50:0.0|Warnings:
chr10 90770515 A6T A T . . Pangolin=ENSG00000026103.23_1|18:0.0|-5:-0.0|Warnings:
chr10 90770516 G7A G A . . Pangolin=ENSG00000026103.23_1|45:0.0|-6:-0.17|Warnings:
chr10 90770516 G7C G C . . Pangolin=ENSG00000026103.23_1|17:0.0|-6:-0.05|Warnings:
chr10 90770516 G7T G T . . Pangolin=ENSG00000026103.23_1|17:0.0|-6:-0.21|Warnings:
chr10 90770517 A8C A C . . Pangolin=ENSG00000026103.23_1|7:0.0|-7:-0.02|Warnings:
chr10 90770517 A8G A G . . Pangolin=ENSG00000026103.23_1|27:0.0|-7:-0.17|Warnings:
chr10 90770517 A8T A T . . Pangolin=ENSG00000026103.23_1|7:0.0|-7:-0.09|Warnings:
chr10 90770518 T9A T A . . Pangolin=ENSG00000026103.23_1|43:0.0|-49:0.0|Warnings:
chr10 90770518 T9C T C . . Pangolin=ENSG00000026103.23_1|9:0.0|-50:0.0|Warnings:
chr10 90770518 T9G T G . . Pangolin=ENSG00000026103.23_1|0:0.0|-8:-0.02|Warnings:
chr10 90770519 C10A C A . . Pangolin=ENSG00000026103.23_1|6:0.0|-9:-0.08|Warnings:
chr10 90770519 C10G C G . . Pangolin=ENSG00000026103.23_1|3:0.0|-50:0.0|Warnings:
chr10 90770519 C10T C T . . Pangolin=ENSG00000026103.23_1|5:0.0|-9:-0.08|Warnings:
chr10 90770520 T11A T A . . Pangolin=ENSG00000026103.23_1|3:0.0|-49:0.0|Warnings:
chr10 90770520 T11C T C . . Pangolin=ENSG00000026103.23_1|2:0.0|-50:0.0|Warnings:
chr10 90770520 T11G T G . . Pangolin=ENSG00000026103.23_1|2:0.01|-49:0.0|Warnings:
Essentially, mutations with predicted positive effects (e.g. T11G) are no longer predicted to have a positive effect. This is troublesome because I have experimental mutagenesis data for this particular exon, and the 'old' pangolin was amazing at predicting the effects of mutations in my dataset, incuding those mutations with a positive score. However, it now seems like positive scores default to 0 in most situations.
Do you know if something changed between October and now that could have affected this?
Thanks again for all the troubleshooting!
Trying to run pangolin with the UCSC hg38 genome, which has some lowercase sequences. "[Line 64] WARNING, skipping variant: Mismatch between FASTA (ref base: g) and variant file (ref base: G)." error subsequently occurs as a result of the if statement at line 110. Attempts have been made to make seq uppercase using built in Python function however this has been unsuccessful in resolving the issue.
Would appreciate accommodations made to the script to support lowercase sequences - if resolved in the meantime, will update issue with the solution.
When using version 1.0.1, it encounters errors for some of the variants with the following error message, for example the first variant below worked fine but the second produced the error. Used GRCh38 annotation file gencode.v38.annotation.db provided. Could you please help to solve the error? Thanks.
CHROM,POS,REF,ALT
chr3,66167878,G,C
chr3,66192184,T,C
Traceback (most recent call last):
File "/opt/conda/bin/pangolin", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 241, in main
scores = process_variant(lnum+1, str(chr), int(pos), ref, alt, gtf, models, args)
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 127, in process_variant
loss_pos, gain_pos = compute_score(ref_seq, alt_seq, '+', d, models)
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 30, in compute_score
ref_seq = one_hot_encode(ref_seq, strand).T
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 22, in one_hot_encode
seq = np.asarray(list(map(int, list(seq))))
ValueError: invalid literal for int() with base 10: 'Y'
Hi,
I'm setting up pangolin in our cluster, and I am struggling to make it work for the VCF example file (csv works fine).
pangolin examples/brca.vcf GRCh37.primary_assembly.genome.fa gencode.v38lift37.annotation.db brca
Using CPU
Traceback (most recent call last):
File "/home/pedro.barbosa/software/miniconda3/envs/pangolin/bin/pangolin", line 8, in <module>
sys.exit(main())
File "/home/pedro.barbosa/software/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin/pangolin.py", line 240, in main
"Format: gene|pos:score_change|pos:score_change|...",'.','.')
TypeError: __new__() missing 1 required positional argument: 'type_code'
Was there any update in the pyvcf dependency that was not accounted for ? I didn't specify version for the dependencies, just followed the github instructions. Please find below the versions I'm using:
pangolin (1.0.2)
pyvcf (0.6.8)
pyfaidx (0.7.0)
gffutils (0.11.0)
biopython (1.79)
Best,
Pedro
Hi!
thanks for making such a great tool!
After reading the paper and as I have started using pangolin, I have some questions on how one should interpret the results.
For example, following your usage example, I ran the following version of your BRCA example: (Note that I have added a line without a mutation.)
gene,CHROM,POS,REF,ALT
BRCA1,17,41276135,T,T
BRCA1,17,41276135,T,G
BRCA1,17,41276135,T,C
This gives the following output:
gene,CHROM,POS,REF,ALT,Pangolin
BRCA1,17,41276135,T,T,ENSG00000012048.23_1|-50:0.0|-50:0.0|Warnings:
BRCA1,17,41276135,T,G,ENSG00000012048.23_1|-3:0.15|-16:-0.04|Warnings:
BRCA1,17,41276135,T,C,ENSG00000012048.23_1|-1:0.86|-3:-0.83|Warnings:
From your README (gene|pos:largest_increase|pos:largest_decrease|
), I understand pos
are the positions with maximum and minimum splicing scores increase and decrease relative to the site where we introduced the mutation.
Then, in the first case (no mutation), should pos
be 0? (of course, this is a special case, I just want to be sure that I understand the output)
How should one interpret the scores for a mutation when both lead to similarly high values?
For example, the 2nd mutation BRCA1,17,41276135,T,G,ENSG00000012048.23_1|-3:0.15|-16:-0.04|Warnings:
generates a splicing score gain at position -3 (w.r.t. the mutation site 41276135) larger than the splicing score loss at position -16. So, in this case, I would expect an increase in PSI right?
However, for the 3rd case, BRCA1,17,41276135,T,C,ENSG00000012048.23_1|-1:0.86|-3:-0.83|Warnings:
, the mutation changes de splice core similarly in both directions and in close sites. Then, should I interpreted that this mutation will change the PSI because splicing sites of similar strength appear one compensating for the other?
Finally, in your paper, you show two examples predicting the effect of mutations in splicing for the Julien et al. and the Baeza-Centurion et al. datasets. In the methods, you explained that, for each variant, you scored the 5' and 3' splice sites. How did you obtain the score gains/losses in both cases separately? Did you do it by increasing the distance
parameter enough to span both sites?
Thank you very much in advance for your help! Cheers,
Miquel
Great tool!
Just a future suggestion - maybe give the user the option to output pangolin results in separate columns rather than just in the INFO column of VCF files as it makes it difficult for users to read?
Alternatively, an even better method would be to write a VEP plugin.
Thank you! :)
The Pangolin seems to be computing intensive and takes a long time even with GPU, is it possible to provide an hg38 genome wide score file similar to spliceAI? Thanks.
Hello, I am getting this kind of error. The input files are hg38 based, db and and reference used were the same as the ones suggested in theGoogle Colab code. My input vcf file at least visually appears to be formatted exactly the same as your example vcf.
Using CPU
Traceback (most recent call last):
File "anaconda3/envs/pytorch/bin/pangolin", line 8, in
sys.exit(main())
File "anaconda3/envs/pytorch/lib/python3.10/site-packages/pangolin/pangolin.py", line 238, in main
variants.infos["Pangolin"] = vcf.parser._Info(
TypeError: Info.new() missing 1 required positional argument: 'type_code'
Could you help me with this please?
Asta
Dear Authors,
Thank you for the great tool.
I want to implement an option to predict scores with batch size larger than 1. During my first tests I noticed, that the predictions differ depending on the batch size. Could you check what might be the reason for this behaviour of the model? Below, I provide the example variant (chr12-110435045-G-A), for which the score differs when it's predicted for the single variant and for the provided batch of size 4: 0.5400000214576721 in the original version against 0.5299999713897705 on the batch. I also provide my code to reproduce the issue. To make the question more compact, I give an example with a prediction mismatch for just one of the models.
import torch
import numpy as np
import pyfastx
from pkg_resources import resource_filename
from pangolin.model import *
###############################################################################################
test_variants = [
'chr12-110435044-T-C',
'chr12-110435044-T-G',
'chr12-110435045-G-A',
'chr12-110435045-G-C',
]
atol = 0.000001 # tolerance value to be used in np.allclose()
d = 50
reference_fasta_path = 'GRCh38.primary_assembly.genome.fa'
###############################################################################################
# the same as in the original version
IN_MAP = np.asarray([[0, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]])
def one_hot_encode(seq, strand):
seq = seq.upper().replace('A', '1').replace('C', '2')
seq = seq.replace('G', '3').replace('T', '4').replace('N', '0')
if strand == '+':
seq = np.asarray(list(map(int, list(seq))))
elif strand == '-':
seq = np.asarray(list(map(int, list(seq[::-1]))))
seq = (5 - seq) % 5 # Reverse complement
return IN_MAP[seq.astype('int8')]
models = []
for i in [0,2,4,6]:
for j in range(1,4):
model = Pangolin(L, W, AR)
if torch.cuda.is_available():
model.cuda()
weights = torch.load(resource_filename(__name__,"models/final.%s.%s.3.v2" % (j, i)))
else:
weights = torch.load(resource_filename(__name__,"models/final.%s.%s.3.v2" % (j, i)), map_location=torch.device('cpu'))
model.load_state_dict(weights)
model.eval()
models.append(model)
###############################################################################################
# process variants
def prepare_variant_for_batch(lnum, chr, pos, ref, alt, fasta, d):
seq = fasta[chr][pos-5001-d:pos+len(ref)+4999+d].seq
ref_seq = seq
alt_seq = seq[:5000+d] + alt + seq[5000+d+len(ref):]
return ref_seq, alt_seq
fasta = pyfastx.Fasta(reference_fasta_path)
batch_chroms = []
batch_positions = []
batch_refs = []
batch_alts = []
for test_variant in test_variants:
chr = test_variant.split('-')[0]
pos = int(test_variant.split('-')[1])
ref = test_variant.split('-')[2]
alt = test_variant.split('-')[3]
ref_seq, alt_seq = prepare_variant_for_batch(0, chr, pos, ref, alt, fasta, d)
batch_chroms.append(chr)
batch_positions.append(pos)
batch_refs.append(ref_seq)
batch_alts.append(alt_seq)
model = models[0]
strand = '-'
# predict on batch
encoded_refs = [] # store encoded reference sequences in a list
encoded_alts = [] # store encoded alternative sequences in a list
for i in range(len(batch_refs)):
ref_seq = torch.from_numpy(one_hot_encode(batch_refs[i], strand).T).float()
alt_seq = torch.from_numpy(one_hot_encode(batch_alts[i], strand).T).float()
encoded_refs.append(ref_seq)
encoded_alts.append(alt_seq)
batch_ref = torch.stack(encoded_refs) # create a tensor with multiple ref sequences
batch_alt = torch.stack(encoded_alts) # create a tensor with multiple alt sequences
if torch.cuda.is_available():
batch_ref = batch_ref.to(torch.device("cuda"))
batch_alt = batch_alt.to(torch.device("cuda"))
with torch.no_grad():
pred_ref = model(batch_ref)[:,[1,4,7,10][j],:].cpu().numpy() # [0][[1,4,7,10][j],:].cpu().numpy() modify indexing
pred_alt = model(batch_alt)[:,[1,4,7,10][j],:].cpu().numpy() # [0][[1,4,7,10][j],:].cpu().numpy() modify indexing
# predict single
i=2
ref_seq = one_hot_encode(batch_refs[i], strand).T
ref_seq = torch.from_numpy(np.expand_dims(ref_seq, axis=0)).float()
alt_seq = one_hot_encode(batch_alts[i], strand).T
alt_seq = torch.from_numpy(np.expand_dims(alt_seq, axis=0)).float()
if torch.cuda.is_available():
ref_seq = ref_seq.to(torch.device("cuda"))
alt_seq = alt_seq.to(torch.device("cuda"))
with torch.no_grad():
pred_ref_single = model(ref_seq)[0][[1,4,7,10][j],:].cpu().numpy()
pred_alt_single = model(ref_seq)[0][[1,4,7,10][j],:].cpu().numpy()
# compare
print(np.allclose(pred_ref_single, pred_ref[i], atol=atol)) # Switches from True to False between atol=0.00001 and atol=0.000001
Hey,
These are the versions for packages I used:
pangolin (1.0.2)
pyvcf (0.6.8)
pyfaidx (0.8.1.1)
gffutils (0.12)
biopython (1.83)
I used this command with vcf as input file:
pangolin /Pangolin/examples/cosmic_pangolin.vcf /gencode.v43.pc_transcripts.fa /Pangolin/gencode.v38.annotation.db cosmic_pangolin
The error I got:
Using CPU
Traceback (most recent call last):
File "/Users/sruthisrinivasan/miniconda3/bin/pangolin", line 8, in <module>
sys.exit(main())
File "/Users/sruthisrinivasan/miniconda3/lib/python3.10/site-packages/pangolin/pangolin.py", line 244, in main
variants.infos["Pangolin"] = vcf.parser._Info(
TypeError: Info.__new__() missing 1 required positional argument: 'type_code'
Please help in solving this issue.
Thanks,
Sruthi
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.