tkzeng / pangolin Goto Github PK

View Code? Open in Web Editor NEW

58.0 58.0 31.0 190.05 MB

Pangolin is a deep-learning method for predicting splice site strengths.

License: GNU General Public License v3.0

Python 84.72% Jupyter Notebook 15.28%

pangolin's People

Contributors

Stargazers

Watchers

pangolin's Issues

Minscores & Maxscores value?

Pangolin provides both maximum scores and minimum scores. However, determining the cutoff value is crucial. We need to establish the specific values that indicate the significance of a splice variant for our purposes. Since I'm conducting whole genome sequencing (WGS), filtering out numerous variants is necessary. Could you guide me on where to find the cutoff scores?

Could pangolin predict the splice site ab initio?

Hi, I am trying to predit the splice site of an RNA virus artificially expressed in host genome. Because the organism I studied is not model organism, there is no VCF file. The pangolin seem to must use a vcf file as the positional arguments. Could pangolin predict the splice site ab initio?

Dependency Versions

Hi,
I had some issues finding the right versions for pangolin to run.
It worked by running:

conda install python=3.6
pip3 install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cpu
conda install -c conda-forge pyvcf
pip install gffutils biopython pandas pyfastx

Here are the conda and pip dependencies:
https://gist.github.com/MartGro/dfe979e79800a210faf0b85b43db62b0

training code for Pangolin?

Hello,

Is the training code for Pangolin available?

WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?

Hello,

I used to be able to use the Google collab notebook just fine, but with the latest updates from a couple of weeks ago, it doesn't seem to work anymore (even when I use the new db file gencode.v38lift37.annotation.Ensembl_canonical.db). I would appreciate it if you could help me figure this one out!

The Brca example dataset doesn't return any predictions at all. This is what the output VCF file looks like:

##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr17	41276135	.	TTAA	GTCT	.	.	Pangolin=
chr17	41276135	.	T	G	.	.	Pangolin=
chr17	41276135	.	T	C	.	.	Pangolin=
chr17	41276135	.	T	A	.	.	Pangolin=
chr17	41276134	.	T	G	.	.	Pangolin=
chr17	41276134	.	T	C	.	.	Pangolin=
chr17	41276134	.	T	A	.	.	Pangolin=
chr17	41276133	.	C	T	.	.	Pangolin=
chr17	41276133	.	C	G	.	.	Pangolin=
chr17	41276133	.	C	A	.	.	Pangolin=
chr17	41276132	.	A	T	.	.	Pangolin=
chr17	41276132	.	A	G	.	.	Pangolin=
chr17	41276132	.	A	C	.	.	Pangolin=
chr17	41276131	.	A	T	.	.	Pangolin=
chr17	41276131	.	A	G	.	.	Pangolin=
chr17	41276131	.	A	C	.	.	Pangolin=
chr17	41276130	.	G	T	.	.	Pangolin=
chr17	41276130	.	G	C	.	.	Pangolin=
chr17	41276130	.	G	A	.	.	Pangolin=
chr17	41276129	.	T	G	.	.	Pangolin=
chr17	41276129	.	T	C	.	.	Pangolin=
chr17	41276129	.	T	A	.	.	Pangolin=
chr17	41276128	.	A	T	.	.	Pangolin=
chr17	41276128	.	A	C	.	.	Pangolin=
chr17	41276127	.	A	T	.	.	Pangolin=
chr17	41276127	.	A	G	.	.	Pangolin=
chr17	41276127	.	A	C	.	.	Pangolin=
chr17	41276126	.	C	T	.	.	Pangolin=
chr17	41276126	.	C	G	.	.	Pangolin=
chr17	41276126	.	C	A	.	.	Pangolin=
chr17	41276125	.	C	T	.	.	Pangolin=
chr17	41276125	.	C	G	.	.	Pangolin=
chr17	41276125	.	C	A	.	.	Pangolin=
chr17	41276124	.	T	G	.	.	Pangolin=
chr17	41276124	.	T	C	.	.	Pangolin=
chr17	41276124	.	T	A	.	.	Pangolin=

The other problem I am having is that Pangolin is complaining about the genome positions I am actually interested in (not just the example dataset) not being in a gene body. My input VCF file is the following:

##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr12	52082542	G1A	G	A	.	.	.
chr12	52082542	G1T	G	T	.	.	.
chr12	52082542	G1C	G	C	.	.	.
chr12	52082543	T2A	T	A	.	.	.
chr12	52082543	T2G	T	G	.	.	.
chr12	52082543	T2C	T	C	.	.	.
chr12	52082576	T35A	T	A	.	.	.
chr12	52082576	T35G	T	G	.	.	.
chr12	52082576	T35C	T	C	.	.	.

These are point mutations in the SCN8A gene. These mutations fall inside the gene according to the annotations, but Pangolin returns the following error:

Using GPU
[Line 30] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 31] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 32] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 33] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 34] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 35] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 36] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 37] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 38] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?

System/server configuration recommendation for running WGS sample with Pangolin

Dear Team,

We are trying to run the Pangolin tool with one of the GiaB (Genome in a Bottle) WGS sample VCF file. Is there a system configuration recommendation for running Pangolin efficiently for such a large sample ? In other words; Kindly let us know, for running Pangolin for a larger sample, what system/computing server configuration would you recommend ?

Thanks in advance.
Siva

Prediction of 5'ss usage and 3'ss usage

Hi! I have a question whether the prediciton performance of 5'ss usage and 3'ss usage are same? I think the 3ss usage is more diffcult to predict. So if is ok, please offer your test set. I will be very appreciate about it!

Problems predicting positive scores

Hi Tony,

Apologies for bothering you again with this. I was running the latest version of pangolin on some previous exon I already did a while back, and interestingly, I get very different results.

These are the results I obtain after using pangolin with default parameters in early October:

##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
10	90770511	A2C	A	C	.	.	Pangolin=ENSG00000026103.22_1|-2:0.0|-1:-0.12
10	90770511	A2G	A	G	.	.	Pangolin=ENSG00000026103.22_1|-3:0.0|-1:-0.14
10	90770511	A2T	A	T	.	.	Pangolin=ENSG00000026103.22_1|-1:0.18|-7:-0.0
10	90770512	T3A	T	A	.	.	Pangolin=ENSG00000026103.22_1|-8:0.0|-2:-0.05
10	90770512	T3C	T	C	.	.	Pangolin=ENSG00000026103.22_1|-8:0.0|-2:-0.12
10	90770512	T3G	T	G	.	.	Pangolin=ENSG00000026103.22_1|1:0.0|-2:-0.11
10	90770513	C4A	C	A	.	.	Pangolin=ENSG00000026103.22_1|4:0.0|-3:-0.15
10	90770513	C4G	C	G	.	.	Pangolin=ENSG00000026103.22_1|4:0.0|-3:-0.1
10	90770513	C4T	C	T	.	.	Pangolin=ENSG00000026103.22_1|-50:-0.0|-3:-0.18
10	90770514	C5A	C	A	.	.	Pangolin=ENSG00000026103.22_1|-4:0.08|3:-0.0
10	90770514	C5G	C	G	.	.	Pangolin=ENSG00000026103.22_1|-4:0.05|3:-0.0
10	90770514	C5T	C	T	.	.	Pangolin=ENSG00000026103.22_1|-18:0.0|-4:-0.26
10	90770515	A6C	A	C	.	.	Pangolin=ENSG00000026103.22_1|-5:0.02|-5:-0.01
10	90770515	A6G	A	G	.	.	Pangolin=ENSG00000026103.22_1|-5:0.18|2:-0.0
10	90770515	A6T	A	T	.	.	Pangolin=ENSG00000026103.22_1|13:0.0|-5:-0.06
10	90770516	G7A	G	A	.	.	Pangolin=ENSG00000026103.22_1|2:0.0|-6:-0.15
10	90770516	G7C	G	C	.	.	Pangolin=ENSG00000026103.22_1|-8:0.0|-6:-0.06
10	90770516	G7T	G	T	.	.	Pangolin=ENSG00000026103.22_1|17:0.0|-6:-0.19
10	90770517	A8C	A	C	.	.	Pangolin=ENSG00000026103.22_1|-13:0.0|-7:-0.04
10	90770517	A8G	A	G	.	.	Pangolin=ENSG00000026103.22_1|0:0.0|-7:-0.19
10	90770517	A8T	A	T	.	.	Pangolin=ENSG00000026103.22_1|44:0.0|-7:-0.09
10	90770518	T9A	T	A	.	.	Pangolin=ENSG00000026103.22_1|-8:0.02|-8:-0.0
10	90770518	T9C	T	C	.	.	Pangolin=ENSG00000026103.22_1|-8:0.06|-1:-0.0
10	90770518	T9G	T	G	.	.	Pangolin=ENSG00000026103.22_1|-8:0.03|-8:-0.01
10	90770519	C10A	C	A	.	.	Pangolin=ENSG00000026103.22_1|-11:0.0|-9:-0.07
10	90770519	C10G	C	G	.	.	Pangolin=ENSG00000026103.22_1|-9:0.06|-15:-0.0
10	90770519	C10T	C	T	.	.	Pangolin=ENSG00000026103.22_1|-11:0.0|-9:-0.07
10	90770520	T11A	T	A	.	.	Pangolin=ENSG00000026103.22_1|-10:0.23|41:-0.0
10	90770520	T11C	T	C	.	.	Pangolin=ENSG00000026103.22_1|-10:0.28|41:-0.0
10	90770520	T11G	T	G	.	.	Pangolin=ENSG00000026103.22_1|-10:0.4|41:-0.0

This is what I obtain after running pangolin on the same exon today:

##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr10	90770511	A2C	A	C	.	.	Pangolin=ENSG00000026103.23_1|22:0.0|-1:-0.14|Warnings:
chr10	90770511	A2G	A	G	.	.	Pangolin=ENSG00000026103.23_1|22:0.0|-1:-0.1|Warnings:
chr10	90770511	A2T	A	T	.	.	Pangolin=ENSG00000026103.23_1|6:0.01|-50:0.0|Warnings:
chr10	90770512	T3A	T	A	.	.	Pangolin=ENSG00000026103.23_1|6:0.0|-2:-0.05|Warnings:
chr10	90770512	T3C	T	C	.	.	Pangolin=ENSG00000026103.23_1|21:0.0|-2:-0.12|Warnings:
chr10	90770512	T3G	T	G	.	.	Pangolin=ENSG00000026103.23_1|1:0.0|-2:-0.12|Warnings:
chr10	90770513	C4A	C	A	.	.	Pangolin=ENSG00000026103.23_1|4:0.0|-3:-0.17|Warnings:
chr10	90770513	C4G	C	G	.	.	Pangolin=ENSG00000026103.23_1|11:0.0|-3:-0.09|Warnings:
chr10	90770513	C4T	C	T	.	.	Pangolin=ENSG00000026103.23_1|-16:0.0|-3:-0.21|Warnings:
chr10	90770514	C5A	C	A	.	.	Pangolin=ENSG00000026103.23_1|4:0.0|-50:0.0|Warnings:
chr10	90770514	C5G	C	G	.	.	Pangolin=ENSG00000026103.23_1|10:0.0|-50:0.0|Warnings:
chr10	90770514	C5T	C	T	.	.	Pangolin=ENSG00000026103.23_1|-37:0.0|-4:-0.27|Warnings:
chr10	90770515	A6C	A	C	.	.	Pangolin=ENSG00000026103.23_1|18:0.0|-50:0.0|Warnings:
chr10	90770515	A6G	A	G	.	.	Pangolin=ENSG00000026103.23_1|-48:0.0|-50:0.0|Warnings:
chr10	90770515	A6T	A	T	.	.	Pangolin=ENSG00000026103.23_1|18:0.0|-5:-0.0|Warnings:
chr10	90770516	G7A	G	A	.	.	Pangolin=ENSG00000026103.23_1|45:0.0|-6:-0.17|Warnings:
chr10	90770516	G7C	G	C	.	.	Pangolin=ENSG00000026103.23_1|17:0.0|-6:-0.05|Warnings:
chr10	90770516	G7T	G	T	.	.	Pangolin=ENSG00000026103.23_1|17:0.0|-6:-0.21|Warnings:
chr10	90770517	A8C	A	C	.	.	Pangolin=ENSG00000026103.23_1|7:0.0|-7:-0.02|Warnings:
chr10	90770517	A8G	A	G	.	.	Pangolin=ENSG00000026103.23_1|27:0.0|-7:-0.17|Warnings:
chr10	90770517	A8T	A	T	.	.	Pangolin=ENSG00000026103.23_1|7:0.0|-7:-0.09|Warnings:
chr10	90770518	T9A	T	A	.	.	Pangolin=ENSG00000026103.23_1|43:0.0|-49:0.0|Warnings:
chr10	90770518	T9C	T	C	.	.	Pangolin=ENSG00000026103.23_1|9:0.0|-50:0.0|Warnings:
chr10	90770518	T9G	T	G	.	.	Pangolin=ENSG00000026103.23_1|0:0.0|-8:-0.02|Warnings:
chr10	90770519	C10A	C	A	.	.	Pangolin=ENSG00000026103.23_1|6:0.0|-9:-0.08|Warnings:
chr10	90770519	C10G	C	G	.	.	Pangolin=ENSG00000026103.23_1|3:0.0|-50:0.0|Warnings:
chr10	90770519	C10T	C	T	.	.	Pangolin=ENSG00000026103.23_1|5:0.0|-9:-0.08|Warnings:
chr10	90770520	T11A	T	A	.	.	Pangolin=ENSG00000026103.23_1|3:0.0|-49:0.0|Warnings:
chr10	90770520	T11C	T	C	.	.	Pangolin=ENSG00000026103.23_1|2:0.0|-50:0.0|Warnings:
chr10	90770520	T11G	T	G	.	.	Pangolin=ENSG00000026103.23_1|2:0.01|-49:0.0|Warnings:

Essentially, mutations with predicted positive effects (e.g. T11G) are no longer predicted to have a positive effect. This is troublesome because I have experimental mutagenesis data for this particular exon, and the 'old' pangolin was amazing at predicting the effects of mutations in my dataset, incuding those mutations with a positive score. However, it now seems like positive scores default to 0 in most situations.

Do you know if something changed between October and now that could have affected this?

Thanks again for all the troubleshooting!

Reference genome mismatch due to lowercase sequence

https://github.com/tkzeng/Pangolin/blob/5cf94b8db938c658391b4305cd7ce33297d44ff7/pangolin/pangolin.py#LL110C1-L111C1

Trying to run pangolin with the UCSC hg38 genome, which has some lowercase sequences. "[Line 64] WARNING, skipping variant: Mismatch between FASTA (ref base: g) and variant file (ref base: G)." error subsequently occurs as a result of the if statement at line 110. Attempts have been made to make seq uppercase using built in Python function however this has been unsuccessful in resolving the issue.

Would appreciate accommodations made to the script to support lowercase sequences - if resolved in the meantime, will update issue with the solution.

Error for some variants

When using version 1.0.1, it encounters errors for some of the variants with the following error message, for example the first variant below worked fine but the second produced the error. Used GRCh38 annotation file gencode.v38.annotation.db provided. Could you please help to solve the error? Thanks.

CHROM,POS,REF,ALT
chr3,66167878,G,C
chr3,66192184,T,C

Traceback (most recent call last):
File "/opt/conda/bin/pangolin", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 241, in main
scores = process_variant(lnum+1, str(chr), int(pos), ref, alt, gtf, models, args)
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 127, in process_variant
loss_pos, gain_pos = compute_score(ref_seq, alt_seq, '+', d, models)
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 30, in compute_score
ref_seq = one_hot_encode(ref_seq, strand).T
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 22, in one_hot_encode
seq = np.asarray(list(map(int, list(seq))))
ValueError: invalid literal for int() with base 10: 'Y'

Error running pangolin with vcf files

Hi,

I'm setting up pangolin in our cluster, and I am struggling to make it work for the VCF example file (csv works fine).

pangolin examples/brca.vcf GRCh37.primary_assembly.genome.fa gencode.v38lift37.annotation.db brca

Using CPU
Traceback (most recent call last):
  File "/home/pedro.barbosa/software/miniconda3/envs/pangolin/bin/pangolin", line 8, in <module>
    sys.exit(main())
  File "/home/pedro.barbosa/software/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin/pangolin.py", line 240, in main
    "Format: gene|pos:score_change|pos:score_change|...",'.','.')
TypeError: __new__() missing 1 required positional argument: 'type_code'

Was there any update in the pyvcf dependency that was not accounted for ? I didn't specify version for the dependencies, just followed the github instructions. Please find below the versions I'm using:

pangolin (1.0.2)
pyvcf (0.6.8)
pyfaidx (0.7.0)
gffutils (0.11.0)
biopython (1.79)

Best,
Pedro

interpretation of pangolin results

Hi!

thanks for making such a great tool!

After reading the paper and as I have started using pangolin, I have some questions on how one should interpret the results.

For example, following your usage example, I ran the following version of your BRCA example: (Note that I have added a line without a mutation.)

gene,CHROM,POS,REF,ALT
BRCA1,17,41276135,T,T
BRCA1,17,41276135,T,G
BRCA1,17,41276135,T,C

This gives the following output:

gene,CHROM,POS,REF,ALT,Pangolin
BRCA1,17,41276135,T,T,ENSG00000012048.23_1|-50:0.0|-50:0.0|Warnings:
BRCA1,17,41276135,T,G,ENSG00000012048.23_1|-3:0.15|-16:-0.04|Warnings:
BRCA1,17,41276135,T,C,ENSG00000012048.23_1|-1:0.86|-3:-0.83|Warnings:

From your README (gene|pos:largest_increase|pos:largest_decrease|), I understand pos are the positions with maximum and minimum splicing scores increase and decrease relative to the site where we introduced the mutation.

Then, in the first case (no mutation), should pos be 0? (of course, this is a special case, I just want to be sure that I understand the output)
How should one interpret the scores for a mutation when both lead to similarly high values?

For example, the 2nd mutation BRCA1,17,41276135,T,G,ENSG00000012048.23_1|-3:0.15|-16:-0.04|Warnings: generates a splicing score gain at position -3 (w.r.t. the mutation site 41276135) larger than the splicing score loss at position -16. So, in this case, I would expect an increase in PSI right?

However, for the 3rd case, BRCA1,17,41276135,T,C,ENSG00000012048.23_1|-1:0.86|-3:-0.83|Warnings:, the mutation changes de splice core similarly in both directions and in close sites. Then, should I interpreted that this mutation will change the PSI because splicing sites of similar strength appear one compensating for the other?
Finally, in your paper, you show two examples predicting the effect of mutations in splicing for the Julien et al. and the Baeza-Centurion et al. datasets. In the methods, you explained that, for each variant, you scored the 5' and 3' splice sites. How did you obtain the score gains/losses in both cases separately? Did you do it by increasing the distance parameter enough to span both sites?

Thank you very much in advance for your help! Cheers,

Miquel

Suggestion for alternative output/VEP plugin?

Great tool!

Just a future suggestion - maybe give the user the option to output pangolin results in separate columns rather than just in the INFO column of VCF files as it makes it difficult for users to read?

Alternatively, an even better method would be to write a VEP plugin.

Thank you! :)

Possible to provide an hg38 genome wide score file?

The Pangolin seems to be computing intensive and takes a long time even with GPU, is it possible to provide an hg38 genome wide score file similar to spliceAI? Thanks.

missing 1 required positional argument: 'type_code'

Hello, I am getting this kind of error. The input files are hg38 based, db and and reference used were the same as the ones suggested in theGoogle Colab code. My input vcf file at least visually appears to be formatted exactly the same as your example vcf.

Using CPU
Traceback (most recent call last):
File "anaconda3/envs/pytorch/bin/pangolin", line 8, in
sys.exit(main())
File "anaconda3/envs/pytorch/lib/python3.10/site-packages/pangolin/pangolin.py", line 238, in main
variants.infos["Pangolin"] = vcf.parser._Info(
TypeError: Info.new() missing 1 required positional argument: 'type_code'

Could you help me with this please?
Asta

Inconsistency of predictions depending on batch size

Dear Authors,

Thank you for the great tool.

I want to implement an option to predict scores with batch size larger than 1. During my first tests I noticed, that the predictions differ depending on the batch size. Could you check what might be the reason for this behaviour of the model? Below, I provide the example variant (chr12-110435045-G-A), for which the score differs when it's predicted for the single variant and for the provided batch of size 4: 0.5400000214576721 in the original version against 0.5299999713897705 on the batch. I also provide my code to reproduce the issue. To make the question more compact, I give an example with a prediction mismatch for just one of the models.

import torch
import numpy as np
import pyfastx
from pkg_resources import resource_filename
from pangolin.model import *

###############################################################################################
test_variants = [
    'chr12-110435044-T-C',
    'chr12-110435044-T-G',
    'chr12-110435045-G-A',
    'chr12-110435045-G-C',
]

atol = 0.000001 # tolerance value to be used in np.allclose()
d = 50
reference_fasta_path = 'GRCh38.primary_assembly.genome.fa'
###############################################################################################
# the same as in the original version

IN_MAP = np.asarray([[0, 0, 0, 0],
                     [1, 0, 0, 0],
                     [0, 1, 0, 0],
                     [0, 0, 1, 0],
                     [0, 0, 0, 1]])


def one_hot_encode(seq, strand):
    seq = seq.upper().replace('A', '1').replace('C', '2')
    seq = seq.replace('G', '3').replace('T', '4').replace('N', '0')
    if strand == '+':
        seq = np.asarray(list(map(int, list(seq))))
    elif strand == '-':
        seq = np.asarray(list(map(int, list(seq[::-1]))))
        seq = (5 - seq) % 5  # Reverse complement
    return IN_MAP[seq.astype('int8')]

models = []
for i in [0,2,4,6]:
    for j in range(1,4):
        model = Pangolin(L, W, AR)
        if torch.cuda.is_available():
            model.cuda()
            weights = torch.load(resource_filename(__name__,"models/final.%s.%s.3.v2" % (j, i)))
        else:
            weights = torch.load(resource_filename(__name__,"models/final.%s.%s.3.v2" % (j, i)), map_location=torch.device('cpu'))
        model.load_state_dict(weights)
        model.eval()
        models.append(model)

###############################################################################################
# process variants

def prepare_variant_for_batch(lnum, chr, pos, ref, alt, fasta, d):
    
    seq = fasta[chr][pos-5001-d:pos+len(ref)+4999+d].seq
    
    ref_seq = seq
    alt_seq = seq[:5000+d] + alt + seq[5000+d+len(ref):]
    
    return ref_seq, alt_seq

fasta = pyfastx.Fasta(reference_fasta_path)

batch_chroms = []
batch_positions = []
batch_refs = []
batch_alts = []

for test_variant in test_variants:
    chr = test_variant.split('-')[0]
    pos = int(test_variant.split('-')[1])
    ref = test_variant.split('-')[2]
    alt = test_variant.split('-')[3]
    
    ref_seq, alt_seq = prepare_variant_for_batch(0, chr, pos, ref, alt, fasta, d)
    
    batch_chroms.append(chr)
    batch_positions.append(pos)
    batch_refs.append(ref_seq)
    batch_alts.append(alt_seq)

model = models[0]

strand = '-'

# predict on batch

encoded_refs = [] # store encoded reference sequences in a list
encoded_alts = [] # store encoded alternative sequences in a list
    
for i in range(len(batch_refs)):
    ref_seq = torch.from_numpy(one_hot_encode(batch_refs[i], strand).T).float()
    alt_seq = torch.from_numpy(one_hot_encode(batch_alts[i], strand).T).float()
    encoded_refs.append(ref_seq)
    encoded_alts.append(alt_seq)
        
batch_ref = torch.stack(encoded_refs) # create a tensor with multiple ref sequences
batch_alt = torch.stack(encoded_alts) # create a tensor with multiple alt sequences
    
if torch.cuda.is_available():
    batch_ref = batch_ref.to(torch.device("cuda"))
    batch_alt = batch_alt.to(torch.device("cuda"))

with torch.no_grad():
    pred_ref = model(batch_ref)[:,[1,4,7,10][j],:].cpu().numpy() # [0][[1,4,7,10][j],:].cpu().numpy() modify indexing
    pred_alt = model(batch_alt)[:,[1,4,7,10][j],:].cpu().numpy() # [0][[1,4,7,10][j],:].cpu().numpy() modify indexing

# predict single

i=2

ref_seq = one_hot_encode(batch_refs[i], strand).T
ref_seq = torch.from_numpy(np.expand_dims(ref_seq, axis=0)).float()
alt_seq = one_hot_encode(batch_alts[i], strand).T
alt_seq = torch.from_numpy(np.expand_dims(alt_seq, axis=0)).float()

if torch.cuda.is_available():
    ref_seq = ref_seq.to(torch.device("cuda"))
    alt_seq = alt_seq.to(torch.device("cuda"))

with torch.no_grad():
    pred_ref_single = model(ref_seq)[0][[1,4,7,10][j],:].cpu().numpy()
    pred_alt_single = model(ref_seq)[0][[1,4,7,10][j],:].cpu().numpy()

# compare

print(np.allclose(pred_ref_single, pred_ref[i], atol=atol)) # Switches from True to False between atol=0.00001 and atol=0.000001

error while running the pangolin script

Hey,

These are the versions for packages I used:
pangolin (1.0.2)
pyvcf (0.6.8)
pyfaidx (0.8.1.1)
gffutils (0.12)
biopython (1.83)

I used this command with vcf as input file:
pangolin /Pangolin/examples/cosmic_pangolin.vcf /gencode.v43.pc_transcripts.fa /Pangolin/gencode.v38.annotation.db cosmic_pangolin

The error I got:

Using CPU
Traceback (most recent call last):
  File "/Users/sruthisrinivasan/miniconda3/bin/pangolin", line 8, in <module>
    sys.exit(main())
  File "/Users/sruthisrinivasan/miniconda3/lib/python3.10/site-packages/pangolin/pangolin.py", line 244, in main
    variants.infos["Pangolin"] = vcf.parser._Info(
TypeError: Info.__new__() missing 1 required positional argument: 'type_code'

Please help in solving this issue.

Thanks,
Sruthi

tkzeng / pangolin Goto Github PK

pangolin's People

Contributors

Stargazers

Watchers

Forkers

pangolin's Issues

Recommend Projects

Recommend Topics

Recommend Org