murphycj / agfusion Goto Github PK

View Code? Open in Web Editor NEW

59.0 5.0 25.0 353.24 MB

Python package to annotate and visualize gene fusions.

Home Page: https://www.agfusion.app

License: MIT License

Python 99.14% Shell 0.86%

gene-fusion protein fusion rna-seq python cancer cancer-genomics chimera bioinformatics structural-variation

agfusion's Introduction

Annotate Gene Fusion (AGFusion)

Checkout the webapp: https://www.agfusion.app

AGFusion (pronounced 'A G Fusion') is a python package for annotating gene fusions from the human or mouse genomes. AGFusion simply needs the reference genome, the two gene partners, and the fusion junction coordinates as input, and outputs the following:

FASTA files of cDNA, CDS, and protein sequences.
Visualizes the protein domain and exon architectures of the fusion transcripts.
Saves tables listing the coordinates of protein features and exons included in the fusion.
Optional exon structure and protein domain visualization of the wild-type version of the fusion gene partners.

Some other things to know:

AGFusion automatically predicts the functional effect of the gene fusion (e.g. in-frame, out-of-frame, etc.).
Annotation is by default done only for canonical gene isoforms, but there is the option to annotate all gene non-canonical isoform combinations.
All gene and protein annotation is from Ensembl
Supports up to Ensembl release 95

Installation
Dependencies
Examples
Troubleshooting
License
Citing AGFusion

Installation

Step 1: Install AGFusion.

pip install agfusion

Step 2: Download your desired pyensembl reference genome database. For example:

For GRCh38/hg38:
pyensembl install --species homo_sapiens --release 95

For GRCh37/hg19:
pyensembl install --species homo_sapiens --release 75

For GRCm38/mm10:
pyensembl install --species mus_musculus --release 87

Step 3: Finally, download your desired AGFusion database.

For GRCh38/hg38:
agfusion download -g hg38

For GRCh37/hg19:
agfusion download -g hg19

For GRCm38/mm10:
agfusion download -g mm10

You can view all supported species and ensembl releases with agfusion download -a.

Dependencies

Python 3.7 or higher
Python package dependencies are listed in requirements.txt.

Examples

Basic Usage

You just need to provide the two fusion gene partners (gene symbol, Ensembl ID, or Entrez gene ID), their predicted fusion junctions in genomic coordinates, and the genome build. You can also specify certain transcripts with Ensembl transcript ID or RefSeq ID

Example usage from the command line:

agfusion annotate \
  --gene5prime DLG1 \
  --gene3prime BRAF \
  --junction5prime 31684294 \
  --junction3prime 39648486 \
  -db agfusion.mus_musculus.87.db \
  -o DLG1-BRAF

The protein domain structure of the DLG1-BRAF fusion:

The exon structure of the DLG1-BRAF fusion:

Plotting wild-type protein and exon structure

You can additionally plot the wild-type proteins and exon structures for each gene with --WT flag.

agfusion annotate \
   -g5 ENSMUSG00000022770 \
   -g3 ENSMUSG00000002413 \
   -j5 31684294 \
   -j3 39648486 \
   -db agfusion.mus_musculus.87.db \
   -o DLG1-BRAF \
   --WT

Canonical gene isoforms

By default AGFusion only plots the canonical gene isoforms, but you can tell AGFusion to include non-canonical isoform with the --noncanonical flag.

agfusion annotate \
  -g5 ENSMUSG00000022770 \
  -g3 ENSMUSG00000002413 \
  -j5 31684294 \
  -j3 39648486 \
  -db agfusion.mus_musculus.87.db \
  -o DLG1-BRAF \
  --noncanonical

Input from fusion-finding algorithms

You can provide as input output files from fusion-finding algorithms. Currently supported algorithms are:

Arriba
Bellerophontes
BreakFusion
ChimeraScan
ChimeRScope
deFuse
EricScript
FusionCatcher
FusionHunter
FusionMap
InFusion
JAFFA
LongGF
MapSplice (only if --gene-gtf specified)
STAR-Fusion
TopHat-Fusion

Below is an example for FusionCatcher.

agfusion batch \
  -f final-list_candidate-fusion-genes.txt \
  -a fusioncatcher \
  -o test \
  -db agfusion.mus_musculus.87.db

Graphical parameters

You can change domain names and colors:

agfusion annotate \
  -g5 ENSMUSG00000022770 \
  -g3 ENSMUSG00000002413 \
  -j5 31684294 \
  -j3 39648486 \
  -db agfusion.mus_musculus.87.db \
  -o DLG1-BRAF \
  --recolor "Pkinase_Tyr;red" --recolor "L27_1;blue" \
  --rename "Pkinase_Tyr;Kinase" --rename "L27_1;L27"

You can rescale the protein length so that images of two different fusions have appropriate relative lengths when plotted side by side:

agfusion annotate \
  -g5 ENSMUSG00000022770 \
  -g3 ENSMUSG00000002413 \
  -j5 31684294 \
  -j3 39648486 \
  -db agfusion.mus_musculus.87.db \
  -o DLG1-BRAF \
  --recolor "Pkinase_Tyr;red" --recolor "L27_1;blue" \
  --rename "Pkinase_Tyr;Kinase" --rename "L27_1;L27" \
  --scale 2000
agfusion annotate \
  -g5 FGFR2 \
  -g3 DNM3 \
  -j5 130167703 \
  -j3 162019992 \
  -db agfusion.mus_musculus.87.db \
  -o FGFR2-DNM3 \
  --recolor "Pkinase_Tyr;red" \
  --rename "Pkinase_Tyr;Kinase" \
  --scale 2000

Troubleshooting

(1) Problem: I get a warning message like the following:

2017-08-28 15:02:51,377 - AGFusion - WARNING - No cDNA sequence available for AC073283.4! Will not print cDNA sequence for the AC073283.4-MSH2 fusion. You might be working with an outdated pyensembl. Update the package and rerun 'pyensembl install'

Solution: Run the following to update pyensembl package and database:

git clone [email protected]:hammerlab/pyensembl.git
cd pyensembl
sudo pip install .
pyensembl install --release (your-release) --species (your-species)

(2) Problem: Cannot run agfusion download due to URLError. When downloading the database you may run into this error:

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>

Solution: A potential solution for Mac users is from here. You can run the following command:

/Applications/Python\ 3.8/Install\ Certificates.command

License

MIT license

Citing AGFusion

You can cite bioRxiv: http://dx.doi.org/10.1101/080903

agfusion's People

Contributors

Stargazers

Watchers

Forkers

xtmgah python3pkg wwliao p-anand inambioinfo haoziyeung sanket-desai wangdi2014 badseby jaylanliu microtsiu bacemdatascience mywanuo hbsycxw jianguozhou3 genome lconan icscreative p7k mskcc

agfusion's Issues

issue with batch

Hi,

I tried running batch on STAR-Fusion output that contains a known fusion that works with AGFusion "annotate", but the output folder and .stderr were empty, which made it difficult diagnosing the problem.

I also tried running batch on fusion catcher output that contains a known fusion that also works with AGFusion annotate. The output file only contained 2 (out of 15 total) fusions: 1 fusion that was CDS truncated and 1 fusion that was in-frame (out of 4 total in-frame fusions).
The .stderr read the following:

2017-08-16 12:11:42,101 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No cds.fa file will be written
2017-08-16 12:11:42,101 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No proteins.fa file will be written
2017-08-16 12:11:42,236 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No cds.fa file will be written
2017-08-16 12:11:42,237 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No proteins.fa file will be written
2017-08-16 12:11:42,239 - AGFusion - WARNING - The following output directory already exists! /data/poirier/chuk/fusion_project/AGFusion/AGFusion_all_fusions-filtered/TENM4-79297488_NARS2-78493195
2017-08-16 12:11:42,362 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No cds.fa file will be written
2017-08-16 12:11:42,362 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No proteins.fa file will be written
2017-08-16 12:11:42,363 - AGFusion - WARNING - The following output directory already exists! /data/poirier/chuk/fusion_project/AGFusion/AGFusion_all_fusions-filtered/TENM4-79297488_NARS2-78478683
2017-08-16 12:11:42,486 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No cds.fa file will be written
2017-08-16 12:11:42,486 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No proteins.fa file will be written
2017-08-16 12:11:42,611 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No cds.fa file will be written
2017-08-16 12:11:42,611 - AGFusion - INFO - The TENM4-NARS2 fusion does not produce any protein coding transcripts. No proteins.fa file will be written
Traceback (most recent call last):
File "/home/chuk/bin/miniconda3/bin/agfusion", line 5, in
cli.main()
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/cli.py", line 517, in main
args=args
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/cli.py", line 110, in annotate
exclude=args.exclude_domain
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/model.py", line 419, in save_images
pplot.draw()
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/plot.py", line 844, in draw
self._scale(self.transcript.protein_length)
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/plot.py", line 48, in _scale
if self.scale == -1 or self.scale < seq_length:
TypeError: unorderable types: NoneType() < int()

Commands:

/home/chuk/bin/miniconda3/bin/agfusion batch \ --file /data/poirier/chuk/fusion_project/Rudin2012/STAR-Fusion/637122/star-fusion.fusion_candidates.final.abridged \ -a STAR-Fusion \ -o /data/poirier/chuk/fusion_project/AGFusion/test \ --db /home/chuk/bin/miniconda3/bin/agfusion.homo_sapiens.87.db

/home/chuk/bin/miniconda3/bin/agfusion batch \ --file /data/poirier/chuk/fusion_project/George2015/FusionCatcher/_EGAR00001231913_SCLC_Julie_SCLC_RNAseq_S01512/final-list_candidate-fusion-genes.txt \ -a fusioncatcher \ -o /data/poirier/chuk/fusion_project/AGFusion/AGFusion_all_fusions-filtered \ --db /home/chuk/bin/miniconda3/bin/agfusion.homo_sapiens.87.db

Please let me know what you think may be the problem. Thanks!

can we use AGFusion for annotating gene fusions detected in plant?

Whether AGFusion is applicable for annotating gene fusions in plants that are listed in the Ensembl database. AGFusion is commonly used for characterizing gene fusions in various organisms, but its compatibility with plant species available in the Ensembl database needs to be determined.

STAR-Fusion input

I think, recent update from STAR-Fusion, AGFusion batch query not preferring input. Help please

Traceback (most recent call last): File "/software/python/Python-3.5.5/bin/agfusion", line 5, in <module> cli.main() File "/software/python/Python-3.5.5/lib/python3.5/site-packages/agfusion/cli.py", line 619, in main batch_mode(args, agfusion_db, pyensembl_data, rename, colors) File "/software/python/Python-3.5.5/lib/python3.5/site-packages/agfusion/cli.py", line 161, in batch_mode agfusion_db.logger): File "/software/python/Python-3.5.5/lib/python3.5/site-packages/agfusion/parsers.py", line 47, in __init__ assert line[4] == 'LeftGene', 'Unrecognized STAR-Fusion input' AssertionError: Unrecognized STAR-Fusion input

Ensembl 96 support

Hi,
As the docs state, the current maximal ensemble release is 92. It would be great to have the 96 ensembl release supported - so as to make the AGFusion package compatible with new releases of STAR-Fusion (which uses gencode v29). Would that be possible in future?

Best,
Sergei

Can I use the input from FusionInspector.

I am running rnafusion pipeline (https://nf-co.re/rnafusion/2.1.0) that integrated several callers, but can I use the final results from FusionInspector? Or any ideas how to parse it for AGFusion input?

Include all types of Ensembl domain annotations

Not just Pfam

what file is used for infusion algorithm?

Dear Charlie,
what file is used as input for AGFusion from the output files generated by Infusion?
There are several output files generated by Infusion, such as fusions.txt, fusions.detailed.txt and fusions.detailed.full.txt. Thanks in advance.

Plot wild-type proteins

Add a flag to plot the protein domain structures of the wild-type proteins for each gene partner.

Using agfusion build command for ensembl >v95?

Hi Dr. Charles Murphy,

Thank you for developing AGFusion and making it publicly available.

I would like to use agfusion for ensembl >v95, and it seems

1. I will have to build the database myself using agfusion build

usage: agfusion build [-h] -d DIR -s SPECIES -r RELEASE --pfam PFAM [--server SERVER]

May I know if documentation of this command is available? (ie. what sort of file is expected as input for the --pfam option? I looked at http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/ but unsure which file to use)

2. I will need to manually edit the 'max ensembl version' from 95 to 109 in utils.py

Would this be a correct thing to do - as the current allowed max is v95?

Thank you,
Min

Pyensembl Version

I just tried to follow your instructions to install. Maybe you can update your installation instructions to install pyensembl 95, not 87. Because otherwise an error occurs.

Frame issues

Hi,

I have Arriba fusion caller results (run with Gencode v27 equivalent to Ensembl 90) and wanted to test the following out-of-frame fusion:

#gene1	gene2	strand1(gene/fusion)	strand2(gene/fusion)	breakpoint1	breakpoint2	site1	site2	type	direction1	direction2	split_reads1	split_reads2	discordant_mates	coverage1	coverage2	confidence	closest_genomic_breakpoint1	closest_genomic_breakpoint2	filters	fusion_transcript	reading_frame	peptide_sequence	read_identifiers	sample_name
CHMP1A	DPEP1	-/-	+/-	16:89651569	16:89625844	splice-site	intron	deletion/read-through/5'-5'	upstream	downstream	7	6	1	118	36	high	.	.	duplicates(4),low_entropy(1)	GCTTGGTCGGTTCGATCGCCGCCGGGACCTGACACCGCCCGGAGTTGGCGTCCCTTCTCCCTCTCCGAGTGCTGCTCCTGTCATTGTGGCCATGGACG___ATACCCTGTTCCAGTTGAAG___TTCACGGCGAAGCAGCTGGAGAAGCTGGCCAAGAAGGCGGAGAAGGACTCCAAGGCGGAGCAGGCCAAAGTGAAGAAG\|TCTCTTCGTCTGTCAAGTAGAAGTCATGGATGTACCTACTTTACGACAGTCCTGCTGTGAGCACAAAGGTACCAGCCACTCAGATCACCCTCGGATCAAG___CTCTGTCTCCCAGAACCACACAGAAGCCCCATCCACAGCCAACATGCAGACGCCACCGTGGCCATTTCGGATGATACCACTTCTGCACAGGCAGCCACAGGCAATGCGGACCTACCCACGCCCCCCACACACAGATCATCGCAG	out-of-frame	MDDTLFQLKFTAKQLEKLAKKAEKDSKAEQAKVKK\|slrlssrshgctyfttvll*	.	0062396f-103a-4c25-bfdc-a746782768d9

I ran agfusion like this:

agfusion annotate \
--gene5prime CHMP1A \
--gene3prime DPEP1 \
--junction5prime 89651569 \
--junction3prime 89625844 \
-db agfusion.homo_sapiens.90.db \
-o CHMP1A-DPEP1

I got quite the opposite result from agfusion, predicting it to be an in-frame fusion:

$ cat CHMP1A-DPEP1.fusion_transcripts.txt
5'_gene,3'_gene,5'_transcript,3'_transcript,5'_strand,3'_strand,5'_transcript_biotype,3'_transcript_biotype,Fusion_effect,Protein_length,Protein_weight_(kD)
CHMP1A,DPEP1,ENST00000397901,ENST00000393092,-,+,protein_coding,protein_coding,in-frame,446,49.6940567

I am not sure why that is happening, any ideas/suggestions would be much appreciated.

Add gene direction to the exons output file

We would like to add AGfusion support the our immunotherapy toolkit pVACtools (pvactools.org, https://github.com/griffithlab/pVACtools). We would like to output not only the genomic coordinates, which we can grab from the exons files, but also the direction of transcription for the fusion (i.e. the orientation for each gene at the junction site). If i'm interpreting the output correctly the orientation can only be interpreted from the current results if there is more than one exon involved on both sides of the fusion by looking to see if coordinates increase or decrease. Would it be possible to formally add this information to the AGfusion output to help with situations where only 1 exon is reported on a fusion side?

Saving plots as pdf files

Hi,
I am trying to save the plots as .pdf files, but the output is always a .png. I have tried to pass --type pdf or --type PDF, but I did not succeed.
How should I pass the --type argument to save as pdf?
Thanks
Best
Massimo

issue for running agfusion

Hi there,
Here is the error msg I got, I did follow the exact instructions, hope could help me figure out what is going on please. Thanks in advance.
ERROR:AGFusion:No Ensembl ID found for ENSG00000285530! Check its spelling and if you are using the right genome build.
2020-04-02 09:17:07,026 - AGFusion - ERROR - No Ensembl ID found for IGH! Check its spelling and if you are using the right genome build.

Strange behavior calling fusion frame

Hi,

I recently started using your tool and noticed a strange behaviour when calling the frame conservation of a fusion.

If I command below I get that my fusion is out-of-frame as expected.
agfusion annotate
--gene5prime TMEM87B
--gene3prime MERTK
--junction5prime 112843681
--junction3prime 112722768
-db agfusion.homo_sapiens.75.db
-o TMEM87B-MERTK

This is out-of-frame as the donor exon has 2 bps of the last codon and the aceptor exon has 2 bps of the first codon.

Case 1:
I would expect that if I remove 1 base from the 5prime I will get a in-frame fusion and when I run the command below that is the case:
agfusion annotate
--gene5prime TMEM87B
--gene3prime MERTK
--junction5prime 112843680
--junction3prime 112722768
-db agfusion.homo_sapiens.75.db
-o TMEM87B-MERTK

Case 2:
Now if I try the other way around and instead I add a position to the 3prime junction I would also expect a similar in-frame fusion but that is not the case and I still get an out-of-frame result.

agfusion annotate
--gene5prime TMEM87B
--gene3prime MERTK
--junction5prime 112843681
--junction3prime 112722769
-db agfusion.homo_sapiens.75.db
-o TMEM87B-MERTK

Case 3:
What is also surprising is that if I move one entire codon up the results is again in-frame
agfusion annotate
--gene5prime TMEM87B
--gene3prime MERTK
--junction5prime 112843681
--junction3prime 112722771
-db agfusion.homo_sapiens.75.db
-o TMEM87B-MERTK

My guess is that in case 2, even if it should be an in-frame fusion the resulting fused codon is a stop codon but when I checked the sequence it seems that it is not the case.

Am I missing something?.

Thanks.

please tag releases

I got a request to install this on our cluster, but it would help me to keep it up to date if the repository had tagged releases. Thanks for your consideration.

Images not created for GRCh37

Hi There,

Great tool, I am trying to make plots for GRCh37 and i only get is the *1_cdna.fa, none of the other files and images get created.

This is the command that i run:
agfusion --gene5prime ENSG00000197157 --gene3prime ENSG00000157764 --junction5prime 127389835 --junction3prime 140490224 --genome GRCh37 --out SND1-BRAF

I am able to replicate you example for mouse genome.

Thank you for your help in advance.

Best,
Ronak

Certificate expired on www.agfusion.app

Just FYI, right it's not possible to open the website (HSTS is configured) 🙂

Can This tool used for DNAseq?

Hello,i have a tumor pair data (Tumor/Normal),and i have somatic SVs(NOT RNA-seq),just gDNA sequence,so i'm wondering can i use AGFusion for visuliaze my SV?
And Sometimes there's no png output,here's my command:
agfusion annotate --gene5prime BCR --gene3prime ABL1 --junction5prime 23634672 --junction3prime 133729677 --out BCR-ABL1/ -db /gpfs/users/yanghao/database/bundle/agfusion/agfusion.homo_sapiens.75.db --debug

here's my log:
2018-03-28 07:26:33,778 - AGFusion - DEBUG - Connected to the database /gpfs/users/yanghao/database/bundle/agfusion/agfusion.homo_sapiens.75.db
DEBUG:AGFusion:Connected to the database /gpfs/users/yanghao/database/bundle/agfusion/agfusion.homo_sapiens.75.db
2018-03-28 07:26:33,901 - AGFusion - DEBUG - Found gene symbol entry for BCR: ENSG00000186716
DEBUG:AGFusion:Found gene symbol entry for BCR: ENSG00000186716
2018-03-28 07:26:33,901 - AGFusion - DEBUG - SQLite - SELECT * FROM homo_sapiens_75 WHERE stable_id=="ENSG00000186716"
DEBUG:AGFusion:SQLite - SELECT * FROM homo_sapiens_75 WHERE stable_id=="ENSG00000186716"
2018-03-28 07:26:33,915 - AGFusion - DEBUG - SQLite - SELECT * FROM homo_sapiens_75_transcript WHERE transcript_id=="2386152"
DEBUG:AGFusion:SQLite - SELECT * FROM homo_sapiens_75_transcript WHERE transcript_id=="2386152"
2018-03-28 07:26:33,954 - AGFusion - DEBUG - Found gene symbol entry for ABL1: ENSG00000097007
DEBUG:AGFusion:Found gene symbol entry for ABL1: ENSG00000097007
2018-03-28 07:26:33,954 - AGFusion - DEBUG - SQLite - SELECT * FROM homo_sapiens_75 WHERE stable_id=="ENSG00000097007"
DEBUG:AGFusion:SQLite - SELECT * FROM homo_sapiens_75 WHERE stable_id=="ENSG00000097007"
2018-03-28 07:26:33,964 - AGFusion - DEBUG - SQLite - SELECT * FROM homo_sapiens_75_transcript WHERE transcript_id=="2365065"
DEBUG:AGFusion:SQLite - SELECT * FROM homo_sapiens_75_transcript WHERE transcript_id=="2365065"
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/yanghao/.cache/pyensembl/GRCh37/ensembl75/Homo_sapiens.GRCh37.75.cdna.all.fa.gz.pickle
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/yanghao/.cache/pyensembl/GRCh37/ensembl75/Homo_sapiens.GRCh37.75.ncrna.fa.gz.pickle
2018-03-28 07:26:34,760 - AGFusion - DEBUG - The BCR-ABL1 fusion does not produce any protein coding transcripts. No cds.fa file will be written
DEBUG:AGFusion:The BCR-ABL1 fusion does not produce any protein coding transcripts. No cds.fa file will be written
2018-03-28 07:26:34,760 - AGFusion - DEBUG - The BCR-ABL1 fusion does not produce any protein coding transcripts. No proteins.fa file will be written
DEBUG:AGFusion:The BCR-ABL1 fusion does not produce any protein coding transcripts. No proteins.fa file will be written

here is output files:
[yanghao@c01sn02 AGFusion]$ ll -ht BCR-ABL1/
total 0
-rw-r--r-- 1 yanghao eulerbioinfo 1.8K Mar 28 07:31 BCR-ABL1.exons.txt
-rw-r--r-- 1 yanghao eulerbioinfo 111 Mar 28 07:31 BCR-ABL1.protein_domains.txt
-rw-r--r-- 1 yanghao eulerbioinfo 228 Mar 28 07:31 BCR-ABL1.fusion_transcripts.txt
-rw-r--r-- 1 yanghao eulerbioinfo 6.9K Mar 28 07:31 BCR-ABL1_cdna.fa

as you can see ,there's no png
and in another situation:
agfusion annotate --gene5prime ROS1 --gene3prime TPM3 --junction5prime 117642092 --junction3prime 154142762 --out ROS1-TPM3 -db /gpfs/users/yanghao/database/bundle/agfusion/agfusion.homo_sapiens.75.db --debug

[yanghao@c01sn02 AGFusion]$ ll -ht ROS1-TPM3/
total 0
-rw-r--r-- 1 yanghao eulerbioinfo 2.8K Mar 28 07:28 ROS1-TPM3.exons.txt
-rw-r--r-- 1 yanghao eulerbioinfo 543 Mar 28 07:28 ROS1-TPM3.protein_domains.txt
-rw-r--r-- 1 yanghao eulerbioinfo 239 Mar 28 07:28 ROS1-TPM3.fusion_transcripts.txt
-rw-r--r-- 1 yanghao eulerbioinfo 14K Mar 28 07:28 ENST00000368508-ENST00000368530.exon.png
-rw-r--r-- 1 yanghao eulerbioinfo 14K Mar 28 07:28 ENST00000368508-ENST00000368530.png
-rw-r--r-- 1 yanghao eulerbioinfo 2.1K Mar 28 07:28 ROS1-TPM3_protein.fa
-rw-r--r-- 1 yanghao eulerbioinfo 5.9K Mar 28 07:28 ROS1-TPM3_cds.fa
-rw-r--r-- 1 yanghao eulerbioinfo 6.7K Mar 28 07:28 ROS1-TPM3_cdna.fa

as you can see ,i got 2 pngs.

*P.S: this two situation is from a same sample

So i have 2 anwsers:

can this tool use for DNAseq?
why i can't get png sometimes?

Many Thanks,And this is a great tool.

rename and type not working

Hi,
Thanks for the tool, it is very useful. I am running the latest version (1.21) with Python 2.7.14 and I am having issues with the "rename" and "type" arguments.
Running the basic usage examples I fail to change the name of the protein domains.
Any help n this issues will be highly appreciated.
Thanks
Best,
Massimo

trouble getting agfusion.homo_sapiens.87.db to work with AGFusion

Hi,

I installed AGFusion and I got it to work in your example for DLG1-BRAF for agfusion.mus_musculus.87.db.

However, when I try fusions (ie. RLF-MYCL) using agfusion.homo_sapiens.87.db, I get the following error:
Traceback (most recent call last):
File "/home/chuk/bin/miniconda3/bin/agfusion", line 5, in
cli.main()
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/cli.py", line 486, in main
args=args
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/cli.py", line 83, in annotate
noncanonical=args.noncanonical
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/model.py", line 325, in init
protein_databases=protein_databases,
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/model.py", line 852, in init
self.predict_effect()
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/model.py", line 1550, in predict_effect
self._fetch_protein()
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/agfusion/model.py", line 1028, in _fetch_protein
self.cds.seq = self.cds.seq[0:3*(len(self.cds.seq)/3)]
File "/home/chuk/bin/miniconda3/lib/python3.5/site-packages/Bio/Seq.py", line 240, in getitem
return Seq(self._data[index], self.alphabet)
TypeError: slice indices must be integers or None or have an index method

Please let me know what I can do to try to resolve this issue. Thank you!

Unsupported species for H.sapiens ensembl 92

Hi, I'm getting an unsupported species error. Looking at the docs and code this appears to be referencing the pyensembl database correct? I'm using the latest release 1.2 with ensembl version 92. Any suggestions would be appreciated!

Zach

Error

Traceback (most recent call last):
  File "/usr/local/bin/agfusion", line 5, in <module>
    cli.main()
  File "/usr/local/lib/python3.6/dist-packages/agfusion/cli.py", line 535, in main
    assert species in AVAILABLE_ENSEMBL_SPECIES, 'unsupported species!'
AssertionError: unsupported species!

Command

agfusion batch -f star-fusion.fusion_predictions.tsv -a starfusion -db /opt/pyensembl/pyensembl/GRCh38/ensembl92/Homo_sapiens.GRCh38.92.gtf.db -o agfusion

Issues with agfusion batch: TypeError: save_images() got an unexpected keyword argument 'plot_WT'

Hello,

AGFusion is a wonderful tool！
I ran into an issue with running agfusion batch with my own star-fusion file.

My command:
(agfusion) shpc_100668@shpc44:~$ agfusion batch \

-f /home/shpc_100668/fusion/star-fusion.fusion_predictions.abridged
-a starfusion
-db agfusion.homo_sapiens.95.db
-o /home/shpc_100668/fusion/starfusion2agfusion/
--middlestar
--noncanonical

Error:
2022-11-02 16:06:34,939 - AGFusion - WARNING - Output directory /home/shpc_100668/fusion/starfusion2agfusion/ already exists! Overwriting...
WARNING:AGFusion:Output directory /home/shpc_100668/fusion/starfusion2agfusion/ already exists! Overwriting...
2022-11-02 16:06:34,940 - AGFusion - INFO - Read 12 fusions from the file.
INFO:AGFusion:Read 12 fusions from the file.
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/shpc_100668/.cache/pyensembl/GRCh38/ensembl95/Homo_sapiens.GRCh38.cdna.all.fa.gz.pickle
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/shpc_100668/.cache/pyensembl/GRCh38/ensembl95/Homo_sapiens.GRCh38.ncrna.fa.gz.pickle
Traceback (most recent call last):
File "/home/shpc_100668/anaconda3/envs/agfusion/bin/agfusion", line 5, in
cli.main()
File "/home/shpc_100668/anaconda3/envs/agfusion/lib/python3.6/site-packages/agfusion/cli.py", line 616, in main
batch_mode(args, agfusion_db, pyensembl_data, rename, colors)
File "/home/shpc_100668/anaconda3/envs/agfusion/lib/python3.6/site-packages/agfusion/cli.py", line 171, in batch_mode
batch_out_dir=args.out,
File "/home/shpc_100668/anaconda3/envs/agfusion/lib/python3.6/site-packages/agfusion/cli.py", line 140, in annotate
exclude=args.exclude_domain,
TypeError: save_images() got an unexpected keyword argument 'plot_WT'

I'm really at a loss as to how to proceed, and any guidance would be much appreciated!
Thank you for your kind help!

Unable to download database(agfusion.homo_sapiens.94.db)

Hi Charlie,

When I tried to download database (agfusion.homo_sapiens.94.db) with the command below:
agfusion download --species homo_sapiens --release 94

I got the following output:
Downloading the AGFusion database to agfusion.homo_sapiens.94.db.gz...
Was unable to download the file https://s3.amazonaws.com/agfusion/agfusion.homo_sapiens.94.db.gz!

What can be the reasons?

Though command "agfusion download -a" shows homo_sapiens.94 db is available.

Thanx in advance!

Regards,
Rajendra

Failing on gene names containing '/'

Hi!

This might sound impossible (and impractical, and comical) but there are actually genes in databases whose names contain slashes. This seems to create quite some problems for AGFusion when it's trying to save files using gene names in filenames. So trying to annotate THRA--THRA1/BTR fusion produces an error:
agfusion annotate -g5 ENST00000546243 -g3 ENST00000621191 -j5 40086853 -j3 48294347 -db /agfusion.homo_sapiens.95.db -o agfusion --recolor "Pkinase_Tyr;red" --dpi 300 --WT

pyensembl.sequence_data:Loaded sequence dictionary from /root/.cache/pyensembl/GRCh38/ensembl95/Homo_sapiens.GRCh38.cdna.all.fa.gz.pickle
pyensembl.sequence_data:Loaded sequence dictionary from /root/.cache/pyensembl/GRCh38/ensembl95/Homo_sapiens.GRCh38.ncrna.fa.gz.pickle
Traceback (most recent call last):
File "/usr/local/bin/agfusion", line 5, in
cli.main()
File "/usr/local/lib/python3.6/site-packages/agfusion/cli.py", line 616, in main
scale=args.scale
File "/usr/local/lib/python3.6/site-packages/agfusion/cli.py", line 117, in annotate
middlestar=args.middlestar
File "/usr/local/lib/python3.6/site-packages/agfusion/model.py", line 596, in save_transcript_cdna
'w'
FileNotFoundError: [Errno 2] No such file or directory: '....agfusion/THRA-THRA1/BTR_cdna.fa'

All the best,
Sergei

Overlapping text

The text on some output graphics (e.g. the protein domains) sometimes overlap.

Documention

Thank you for the tool. I really enjoy the ability to know the fusion parameters. Is it possible to get some documentation for the fusions_transcripts.txt? Specifically how do exon-exon differ from in-frame or out-of-frame?

Frameshift fusions warning

Hi!

I'm having a bit of a problem with out-of-frame fusions. When I annotate an out-of-frame fusion with:
agfusion annotate -g5 ENST00000311922 -g3 ENST00000517315 -j5 125431262 -j3 125357219 -db agfusion.homo_sapiens.95.db --recolor "Pkinase_Tyr;red" --dpi 300 --WT

I catch a warning:
AGFusion - WARNING - Fusion isoform effect is not out-of-frame but CDS is not a multiple of 3!
Even though the fusion_transcripts.txt file explicitly says fusion isoform effect is out-of-frame (and it should be, and the length of CDS is indeed 686, not divisible by 3):

5'_gene,3'_gene,5'_transcript,3'_transcript,5'_strand,3'_strand,5'_transcript_biotype,3'_transcript_biotype,Fusion_effect,Protein_length,Protein_weight_(kD)
TRIB1,NSMCE2,ENST00000311922,ENST00000517315,+,+,protein_coding,protein_coding,out-of-frame,138,14.450594999999986

The warning does not affect the files in the sense that domain and exon structure still get reported (and the sequence stops at the first out-of-frame stop codon found), but still the warning seems to be a bit misinforming.

All the best,
Sergei

HTML output and downloaded figure on web version have different font size or size

E.g. specify via on the web to plot images at 12 by 3 inches or something, but the image gets downloaded at a different dimension

TypeError: iter() returned non-iterator of type

Hello,

I want to use AGfusion on STAR-Fusion and FusionCatcher output files.

But i have this error :

"Traceback (most recent call last):
File "/illumina/software/Miniconda2/envs/MyPython27Env/bin/agfusion", line 5, in
cli.main()
File "/illumina/software/Miniconda2/envs/MyPython27Env/lib/python2.7/site-packages/agfusion/cli.py", line 432, in main
for fusion in agfusion.parsersargs.algorithm:
TypeError: iter() returned non-iterator of type 'STARFusion'"

cmd line use :
agfusion batch --file star_fusion_outdir/star-fusion.fusion_candidates.final -a starfusion -o test -g GRCh38 --dbpath /illumina/databases/Human_hg38/agfusion/agfusion.db

Do you have a solution?

Thanks,
Best Regards,

Steven

Build database for Ensembl 100

I tried to build database for Ensembl 100 using the following command line.

pyensembl install --species homo_sapiens --release 100
agfusion build --dir . --species homo_sapiens --release 100 --pfam pfamA.txt.gz --server ensembldb.ensembl.org

Traceback (most recent call last):
  File "/data/user/software/anaconda3/bin/agfusion", line 5, in <module>
    cli.main()
  File "/data/user/software/anaconda3/lib/python3.7/site-packages/agfusion/cli.py", line 520, in main
    builddb(args)
  File "/data/user/software/anaconda3/lib/python3.7/site-packages/agfusion/cli.py", line 204, in builddb
    args.server
  File "/data/user/software/anaconda3/lib/python3.7/site-packages/agfusion/database.py", line 68, in __init__
    self.table = ENSEMBL_MYSQL_TABLES[self.species][self.release]
KeyError: 100

Does anyone know how to fix this problem?

Include exon number

In the exon plots, have option to print exon number.

Error plotting WT protein

agfusion annotate --gene5prime PML --gene3prime RARA --junction5prime 74315749 --junction3prime 38504566 --genome GRCh37 --out PML-RARA --WT

Produces the following error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/agfusion", line 5, in <module>
    cli.main()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/agfusion/cli.py", line 428, in main
    args=args
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/agfusion/cli.py", line 82, in annotate
    exclude=args.exclude_domain
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/agfusion/model.py", line 499, in save_images
    pplot.draw()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/agfusion/plot.py", line 869, in draw
    self._draw_protein_length_markers(len(self.ensembl_transcript.coding_sequence)/3)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/agfusion/plot.py", line 621, in _draw_protein_length_markers
    self.offset+self.protein_frame_length
AttributeError: 'PlotWTProtein' object has no attribute 'protein_frame_length'

support for LongGF (new fusion finder for long read sequences)

would be perfect that AGfusion be compatible with the output of the LongGF method.

https://github.com/WGLab/LongGF

Thanks

using annotate --WT option causes exception

Hello

I'm running agfusion annoate with the --WT option and I got the error:

File "/Users/davelahr/miniconda2/envs/pyensembl/lib/python3.6/site-packages/agfusion/plot.py", line 657, in _draw_protein_length_markers
    for i in range(1, protein_length+1):
TypeError: 'float' object cannot be interpreted as an integer

I modified the code in question to be:
for i in range(1, int(protein_length)+1):

And then it worked. Not sure if this is the best way to fix.

Thank you for the very useful software!

5' UTR sequence not included when 3' gene junction is in 5' UTR

murphycj/agfusionweb-react#4

Potential dependency conflicts between agfusion and numpy

Hi, as shown in the following full dependency graph of agfusion, pyensembl requires numpy >=1.7 ， while the installed version of gtfparse(1.2.0) requires numpy >=1.7,<2.0.

According to Pip's “first found wins” installation strategy, numpy 1.18.0 is the actually installed version.

Although the first found package version numpy 1.18.0 just satisfies the later dependency constraint （numpy >=1.7,<2.0), it will lead to a build failure once developers release a newer version of numpy .

Dependency tree--------

agfusion - 1.252
| +- biopython(install version:1.76 version range:>=1.67)
| +- future(install version:0.18.2 version range:>=0.16.0)
| +- matplotlib(install version:3.2.0rc1 version range:>=1.5.0)
| | +- cycler(install version:0.10.0 version range:>=0.10)
| | | +- six(install version:1.13.0 version range:*)
| | +- kiwisolver(install version:1.1.0 version range:>=1.0.1)
| | | +- setuptools(install version:42.0.2 version range:*)
| | +- numpy(install version:1.18.0 version range:>=1.11)
| | +- pyparsing(install version:2.4.5 version range:>=2.0.1)
| | +- python-dateutil(install version:2.8.1 version range:>=2.1)
| +- pandas(install version:0.25.3 version range:>=0.18.1)
| +- pyensembl(install version:1.8.4 version range:>=1.1.0)
| | +- datacache(install version:1.1.5 version range:>=1.1.4)
| | | +- appdirs(install version:1.4.3 version range:>=1.4.0)
| | | +- mock(install version:3.0.5 version range:*)
| | | +- pandas(install version:0.25.3 version range:>=0.15.2)
| | | +- progressbar33(install version:2.4 version range:>=2.4)
| | | +- requests(install version:2.22.0 version range:>=2.5.1)
| | | | +- certifi(install version:2019.11.28 version range:>=2017.4.17)
| | | | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | | | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | | | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| | | +- typechecks(install version:0.1.0 version range:>=0.0.2)
| | +- gtfparse(install version:1.2.0 version range:>=1.1.0)
| | | +- numpy(install version:1.18.0 version range:>=1.7,<2.0)
| | | +- pandas(install version:0.25.3 version range:>=0.15)
| | +- memoized-property(install version:1.0.3 version range:>=1.0.2)
| | +- numpy(install version:1.18.0 version range:>=1.7)
| | +- pandas(install version:0.25.3 version range:>=0.15)
| | +- serializable(install version:0.2.1 version range:*)
| | +- six(install version:1.13.0 version range:>=1.9.0)
| | +- tinytimer(install version:0.0.0 version range:*)
| | +- typechecks(install version:0.1.0 version range:>=0.0.2)

Thanks for your attention.
Best,
Neolith

Please note that I do not actively maintain this repo anymore

missing domain in output?

Hello I noticed that it appears that there is not domain information in the agfusion DB for some genes e.g. SSX2. Is that correct and is there a way to get that into my agfusion database?

Support RefSeq IDs

Support for arriba

This is a relatively recent and very nicely engineered fusion caller. Support for it would be great.

https://github.com/suhrig/arriba

Bio.Alphabet has been removed from Biopython.

import error when loading Bio.Alphabet. BioPython version 1.78

>>> import agfusion
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/app/software/Python/Python-3.8.6/lib/python3.8/site-packages/agfusion/__init__.py", line 2, in <module>
    from .model import *
  File "/app/software/Python/Python-3.8.6/lib/python3.8/site-packages/agfusion/model.py", line 9, in <module>
    from Bio.Alphabet import generic_dna, generic_protein
  File "/app/software/Biopython/1.78/Python-3.8.6/lib/python3.8/site-packages/Bio/Alphabet/__init__.py", line 20, in <module>
    raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

"""Alphabets were previously used to declare sequence type and letters (OBSOLETE).

The design of Bio.Aphabet included a number of historic design choices
which, with the benefit of hindsight, were regretable. Bio.Alphabet was
therefore removed from Biopython in release 1.78. Instead, the molecule type is
included as an annotation on SeqRecords where appropriate.

Please see https://biopython.org/wiki/Alphabet for examples showing how to
transition from Bio.Alphabet to molecule type annotations.
"""

Not able to see protein domain information

Hi,

I am trying the below command and I am not able to see any Protein domain information for GRCh37.

agfusion --gene5prime ENSG00000140464 --gene3prime ENSG00000131759 --junction5prime 74325755 --junction3prime 38504568 --genome GRCh37 --out PML-RARA

Please help me regarding this.