gamcil / clinker Goto Github PK

View Code? Open in Web Editor NEW

501.0 501.0 64.0 3.13 MB

Gene cluster comparison figure generator

License: MIT License

Python 74.39% JavaScript 10.52% HTML 12.50% CSS 2.44% Dockerfile 0.16%

bioinformatics d3js python visualization

clinker's Introduction

Hi there 👋

🔭 Currently a postdoc at the Steinegger Lab @ Seoul National University, South Korea (ex. Chooi Lab, The University of Western Australia)
🌱 Interested in developing bioinformatics tools and interactive visualisations
📫 Email: [email protected]

Some projects I've worked on:

clinker & clustermap.js, a pipeline for generating interactive visualisations of homologous gene clusters
cblaster, a search tool for identifying homologous gene clusters in NCBI sequence databases
synthaser, a tool for automatically annotating & classifying domain architectures of multi-domain proteins
Protein structure visualisation in Foldseek, a protein structure search tool

If you want to support me:

clinker's People

Contributors

Stargazers

Watchers

clinker's Issues

Support for PROKKA GenBank files?

Is there a way to make this work? Or do you plan to add support for these? Thanks!

Gene label type issue

In order to reduce clutter I wish to show labels for selected genes only. To this end, I created /note tags only for those genes and set gene label type to "note". Clinker then showed the notes for those genes as intended, however, it also showed protein id's for all other genes which is not what I wanted.
I found a workaround by setting /note=" " for all other genes but it might be better that clinker did not show labels when the selected label type is absent.

Possible to add phylogeny to clinker output plot?

Hi,
I was wondering about if it would be possible to add phylogeny information to a clinker plot.

So for example let's say you have some gene clusters made with the GenBank flat files (gbff) of 37 birds

$ ls *.gbff|perl -pe "s/_/ /g"|perl -pe "s/.gbff//g" |perl -pe "s/Strigops habroptila/Strigops greyii/g" > names

$ cat names
Acanthisitta chloris
Anas platyrhynchos
Aptenodytes forsteri
Apteryx rowi
Calidris pugnax
Camarhynchus parvulus
Catharus ustulatus
Charadrius vociferus
Chiroxiphia lanceolata
Columba livia
Corapipo altera
Cyanistes caeruleus
Cygnus atratus
Dromaius novaehollandiae
Egretta garzetta
Falco peregrinus
Ficedula albicollis
Gallus gallus
Geospiza fortis
Haliaeetus leucocephalus
Manacus vitellinus
Meleagris gallopavo
Neopelma chrysocephalum
Nipponia nippon
Nothoprocta perdicaria
Oxyura jamaicensis
Parus major
Phasianus colchicus
Pipra filicauda
Pseudopodoces humilis
Pygoscelis adeliae
Serinus canaria
Strigops greyii
Sturnus vulgaris
Taeniopygia guttata
Tauraco erythrolophus
Zonotrichia albicollis

and then you use the R package rotl to extract one phylogenetic hypothesis for the 37 bird species

> library(rotl)
> taxa<-tnrs_match_names(names= c(scan("names",what="",sep='\n')))
> my_tree <- tol_induced_subtree(ott_ids = taxa$ott_id, label_format="name")
> png(filename = "birds.png",width=600,height=600)
> plot(my_tree, no.margin = TRUE)
> dev.off()
> tol_induced_subtree(ott_ids = ott_id(taxa), file="ghrl.newick.txt", label_format="name")

and you get the following birds.png

and you also get a Newick tree file ghrl.newick.txt that enscapulates the relationship in birds.png in text format

Might it be possible to pass ghrl.newick.txt to clinker to generate something similar to birds.png?

Groups are not restored during session loading

Hi!
First of all, really enjoying using clinker! However, I've also encountered an issue with clinker json session loading -- upon loading seemingly everything but the groups described in session file are loaded.

The cause seems to lie within align.py, Globalalign.from_dict() class method where Group instances restoration is not implemented.
The fix could be seemingly straightforward -- addition of something like this at the end of the method before return ga:

for group in d["groups"]:  
    gr = Group.from_dict(group)
    ga.groups.append(gr)

This also worked on some minimal examples that I had my hand on.

As such, I was wondering, were there any issues encountered with group instance restoration or is this something that just wasn't implemented yet?

Issue with multi-exon genes

It seems clinker does not work as intended when multi-exon genes are annotated with separate CDS instead of with join(). When this is the case, CDS exons are considered separately leading to spurious cluster groups and missing links to genes that are annotated using join(). See example output below for analysis of three test files that each comprise a 14-exon uncharacterized gene and a 2-exon glycosyltransferase gene. In test1 and test2 exons are annotated separately, in test3 they are annotated using join(), see attached files.

In standard output showing genes to scale (below) you see that the number of cluster groups is 15 instead of the expected two and that there is no link to the first gene in test3.

When not showing genes to scale, you see that the spurious cluster groups coincide with separate exons rather than the entire genes.

Gene annotations with separate CDS are quite common in .gff3 files. It would therefore be great if clinker appropriately concatenated such annotations into the full-length CDS before analysis to avoid problems.

Thanks

test1.gb.txt
test2.gb.txt
test3.gb.txt

ValueError: Distance matrix 'X' must be symmetric

Hi,

I'm running Clinker on a server using .gbk files. I keep receiving the same error "ValueError: Distance matrix 'X' must be symmetric" and no output is created. I tried running it against just two of my files as well as all of them in a for loop. I haven't tried running it off the server yet, but if that could be the issue I'll definitely try.

I've attached an output file from the job submission (24299131.txt), my script (clinker_test.txt), and one of the files that it was run on (DSM_27508_154_fasta.phages_combined.gbk.txt) the rest of my files have the same format as well.

Thanks in advance!

24299131.txt
clinker_test.txt
DSM_27508_154_fasta.phages_combined.gbk.txt

Running Clinker on a Server?

Hi all! Just came across Clinker, great tool @gamcil! I was considering using Clinker in my classes, would be cool to run Clinker directly in my server so students didn't have to install it locally. I'm not an expert in web developing but know a bit so I considered to give it a go.

Any suggestions on how would you do it?

Thanks!
Semidán

Align Error: seq contains letters not in the alphabet

Hello, thank you for developing a great tool!

I've been trying to figure out where the "ValueError: sequence contains letters not in the alphabet" error is coming from when I run my .gbf files/.gb files through Clinker. I went through issue #68 and I installed Clinker 0.0.21 through Conda again but to no avail. I have also tried the pip install but that didn't fix the problem. I double checked the align.py script on my local computer and it has the extend_matrix_alphabet addition, so I'm not sure what to do. You mentioned a quick fix would be to go through the sequence and delete anything not part of the extended IUPAC. Is there a particular way you recommend doing this? I have several sequences, so it seems like it would take a long time to identify anything wrong in the sequence (I would be looking for numbers, right?).

I attached an image with the traceback in case it's helpful.

Thank you so much!

Feature Request: show gene label only for genes creating clusters

Great job with a tool, it's very nice and handy!

Would you consider adding an option to show gene labels just for genes that actually created clusters?

In case of a couple of short sequences this can be handled by removing label when editing single genes, however when you compare (multiple) sequences with ~50 genes and want to see whether there are differences in annotations among genes in clusters then it all overlays and merges into one and therefore is hard to read. Then the option is to export SVG and remove text elements in graphical program to keep those of interest.

Here's an example:

Additionally or alternatively, there could be an option to (i) show label for a single gene, (ii) show label for genes from a group/cluster when clicking on genes.

Thank you for considering this enhancement!

How can I get GTF files into this type of software

Hello,

Clinker looks awesome. But, all of my gene location inforamation is in GFF/GTF format, which is rather common for genomic datasets.

Can Clinker take GFF/GTF format, or do you have a good method for converting GFF/GTF files to genebank format?

thanks,
C

prohibitively slow performance

Testing for #9, the good news is using the --compliant switch for PROKKA apparently allows the script to continue beyond where it would previously crash, but then clinker engages in slow, one-thread, pairwise alignment clustering that does not scale well, making it too slow to use for more than a few genomes. A couple simple changes that would alleviate this:

multiprocessing - This alone would vastly improve the performance, though memory is a concern as I see a single thread eating over 8GB RAM when I run hundreds of genomes.
allow users to start with their own multi-alignments - I reckon most of us don't need (or want) clinker to do the alignments for us, because we either already have alignments, or have a faster way to run them. If you don't want to make that possible, I guess I might fork it myself.

Thanks!

RecursionError in `clinker.align.consolidate` (and a proposed fix)

Hi again @gamcil ,

I used clinker to align 60 clusters, each of them sharing between 1 and 4 homolog protein. This caused clinker.align.consolidate to throw a RecursionError, as it seemed to have some issues merging everything.

As you pointed in the documentation, you used the Rosetta stone recursive implementation; there is an iterative implementation that would likely fix that issue. However, there is an algorithmic data structure that would work better for the kind of task that GlobalAligner.build_gene_groups is trying to achieve: a disjoint-set.

If you are fine adding an external pure-Python dependency (disjoint-set), here is the replacement code:

class GlobalAligner:

    # ...

    def build_gene_groups(self):
        """Builds gene groups based on currently stored gene-gene links."""

        ds = disjoint_set.DisjointSet()
        for link in self._links.values():
            ds.union(link.query.uid, link.target.uid)

        for genes in ds.itersets():
            group = Group(label=f"Group {len(self.groups)}", genes=list(genes))
            self.groups.append(group)

Possible to ignore coordinates and show gene order only?

Thanks so much for what after some initial trialing seems to be a very useful tool!
I am looking at plant genomic sequences that often have huge intergenic spaces due to accumulation of repetitive elements. Thus, genes appear rather sparsely in some parts of my plots.

Would it be possible to ignore coordinates and show the ordering of genes only? (e.g. by setting scale to 0)
Or maybe I have missed such an option in the documentation?

Thanks

different colour code for similar genes

First of all I want to thank you for the amazing tool.
I have a question.
During my comparison, very similar genes had different colours, as you can see at the following picture

Any clue?

Error comparing .gbk files outputted from artemis

Hi,

We're really interested in using your software, but a lot of our gbk files have been created in artemis which often causes problems when moving to other software. We have been able to successfully compare gbk files downloaded from MiBIG, so the installation of clinker is working. When we try to compare artemis generated files we get the following error:

C:\Users\kw2990\University of Bristol\grp-LSB-NSP - Documents\Writing\Reviews\3. Maleidride review\Maleidride bioinformatics\Maleidride clusters>clinker Oidmal.gbk Cadophora.gbk -p
[15:13:51] INFO - Starting clinker
[15:13:51] INFO - Parsing GenBank files: ['Oidmal.gbk', 'Cadophora.gbk']
[15:13:51] INFO - Starting cluster alignments
[15:13:51] INFO - Oidmal vs Cadophora
[15:13:51] INFO - Generating results summary...
Oidmal vs Cadophora

Query Target Identity Similarity
[15:13:51] INFO - Building clustermap.js visualisation
C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\clinker\align.py:356: RuntimeWarning: invalid value encountered in true_divide
matrix /= matrix.max()
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\Scripts\clinker.exe_main.py", line 7, in
File "C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\clinker\main.py", line 144, in main
clinker(
File "C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\clinker\main.py", line 77, in clinker
plot_clusters(globaligner, output=None if plot is True else plot)
File "C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\clinker\plot.py", line 114, in plot_clusters
data = clusters.to_data()
File "C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\clinker\align.py", line 201, in to_data
for i in self.order(i=i, method=method)
File "C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\clinker\align.py", line 371, in order
linkage = hierarchy.linkage(squareform(matrix), method=method)
File "C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\scipy\spatial\distance.py", line 2184, in squareform
is_valid_dm(X, throw=True, name='X')
File "C:\Users\kw2990\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\scipy\spatial\distance.py", line 2259, in is_valid_dm
raise ValueError(('Distance matrix '%s' must be '
ValueError: Distance matrix 'X' must be symmetric.

I have attached the two files in question as a zipped file
Sequences.zip

Thanks in advance for any help provided!

Error in align

Hi Cameron!

I installed clinker v0.0.20 by using conda.
When I ran soft for two genome which are downloaded from Genbank the clinker gives error.
Is there any solution to this problem?

Best wishes,
Marsel

[09:02:59] INFO - Starting clinker
[09:02:59] INFO - Parsing files:
[09:02:59] INFO - PB12_4term_CP048407.gbk
/home/kabilov/anaconda3/envs/clinker/lib/python3.9/site-packages/Bio/Seq.py:2334: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
warnings.warn(
[09:03:03] INFO - T.marianensis_NC_014831.gbk
[09:03:06] INFO - Starting cluster alignments
[09:03:07] INFO - PB12_4term_CP048407 vs T.marianensis_NC_014831
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/site-packages/clinker/align.py", line 377, in _align_clusters
aln = aligner.align(geneA.translation, geneB.translation)
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/site-packages/Bio/Align/init.py", line 1592, in align
score, paths = _aligners.PairwiseAligner.align(self, seqA, seqB)
ValueError: sequence contains letters not in the alphabet
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kabilov/anaconda3/envs/clinker/bin/clinker", line 10, in
sys.exit(main())
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/site-packages/clinker/main.py", line 283, in main
clinker(
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/site-packages/clinker/main.py", line 135, in clinker
globaligner = align.align_clusters(*clusters, cutoff=identity, jobs=jobs)
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/site-packages/clinker/align.py", line 57, in align_clusters
aligner.align_stored_clusters(cutoff, jobs=jobs)
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/site-packages/clinker/align.py", line 404, in align_stored_clusters
alignments = pool.starmap(_align_clusters, pairs_to_align)
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/kabilov/anaconda3/envs/clinker/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
ValueError: sequence contains letters not in the alphabet

Addition to help text: Double click flips contig

One thing that I thought was missing so bad that I was going to raise an issue is that you can't flip contigs in the plot! Except you can! However it's not in any of the manual, wiki, or instructions in the html help box.

Suggestion: Add the line:
"To flip orientation of contig double click"

Changing color of a gene to a specific desired color

Really enjoying the clinker tool, but was wondering if there was a way to manually change the color of a particular gene in the cluster to a color of my choosing. For instance, if I have a cluster with 5 p450s, it often colors some in one color and others in another color, but it doesn't seem that I can manually change them to display as the same color. Is there a feature for this in the current version of clinker?

cluster similarity matrix

Hi Cameron, is there a way I can check the cluster similarity matrix raw data?

Feature request: Showing mobile genetic elements in the figure

Hey,

I was working on a linear comparison of plasmid sequences and I wonder if there is a way to incorporate a possibility to show mobile genetic elements (IS, Tn) in a way that they don't overlap and hide already depicted genes/CDS. I propose some kind of background bar or somehow as I did.

Thanks for your consideration.

Feature request(s) - On-gene label, autosource group names, legend titles editing, support for manually adding text boxes

First off, Wow! well done @gamcil, well done. I've went over so many tools and packages and what not lately to try and generate such figures and clinker is by far the easiest and most visually appealing.
Here are some feature requests I would love to see added, sorry if any of them are already possible:

On-gene labeling - placing the label within the gene arrow (instead of above it).
Auto source group names - i.e. instead of "cluster_n" or "group_n" in the legend, automatically use the most frequent attribute (of user choice) of that group members - i.e. "product". Alternatively, add option in the gene label editor (that pops when clicking a gene) to assign it's attribute to the entire group.
Legend titles editing, support for manually adding text boxes (and editing their font size via an added section)
Support for Non-grey color gradient for the identity (e.g. yellow to red).
Option to only display gene labels only for a selected genomes.
Add "reset" option.
Enable support for non CDS containing sequences, and for segmented sequences on the same track.
Support for circularity - i.e. setting a new coordiante as the 0 position and moving what was before it to the end of the track.

Thanks again for this great tool, and please let me know if you need someone to beta test anything :-)

How to create subsets of GenBank flat files keeping features for input into clinker?

Hi,

This sounds super cool! Especially for the purposes of small-scale synteny analysis (I think). I really do not have much experience parsing GenBank flat files, and was curious if you might know or have any scripts that could potentially take an entire vertebrate genome's GenBank flat file (with annotations) and output say a 100,000 bases upstream and dowstream from a particular annotated gene [Genome Data Viewer from NCBI has the option, but I don't think this is possible programmatically access outside of the GUI]. I found some tools and could experiment with GenBank parsing with bioPython (such as https://gist.github.com/jrjhealey/06a7fbdfe495bc5a8824ed152dff7919/), but that particular one does not keep the annotation features (although perhaps I could modify it to do so).

Best,
Jean Elbers

Possible errors when running clinker v0.0.11 with GenBank files from MiBIG

Hi, thank you for the great tool. I ran into an error when running the latest version of clinker (v0.0.11) with a GenBank file obtained from MiBIG included. The error is as followed:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/bin/clinker", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/clinker/main.py", line 206, in main
    clinker(
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/clinker/main.py", line 81, in clinker
    globaligner = align.align_clusters(*clusters, cutoff=identity, jobs=jobs)
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/clinker/align.py", line 51, in align_clusters
    aligner.add_clusters(*args)
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/clinker/align.py", line 325, in add_clusters
    self._genes[gene.uid] = gene
AttributeError: 'NoneType' object has no attribute 'uid'

It did not happen with an older version (prior to v0.0.9) in another machine or when the MiBIG GenBank file was not included. Could you check what could be the reason?
Thanks a lot

Error in plot Visualization

Having issues in visualization of plot on internet explorer after the run is completed. I am using the gene bank files generated by RAST server and using four plasmid sequence for comparison. In the start of run warning signs (shown below) appears and then cluster alignment started once the run is finished internet explorer page appears on the screen showing only clinker instructions but no plots. Please help me to solve this issue @gamcil
not find parent gene of . Using coding sequence coordinates instead.
[12:21:45] WARNING - Could not find parent gene of . Using coding sequence coordinates instead.
[12:21:45] WARNING - Could not find parent gene of . Using coding sequence coordinates instead.
[12:21:45] WARNING - Could not find parent gene of . Using coding sequence coordinates instead.
[12:21:45] WARNING - Could not find parent gene of . Using coding sequence coordinates instead.
[12:21:45] WARNING - Could not find parent gene of . Using coding sequence coordinates instead.
[12:21:45] WARNING - Could not find parent gene of . Using coding sequence coordinates instead.
[12:21:45] WARNING - Could not find parent gene of . Using coding sequence coordinates instead

How to adjust the distance between the sample name and the sequence in clinker?

Hello, I want to ask a question. How to adjust the distance between the sample name and the sequence in clinker? There is a big gap in the middle. As shown below:

Gene labels not extracting correctly

Hi Cameron, I just came across a small issue that took me awhile to fix. Apparently, if clinker is given a genbank file with only one field per CDS, say /product (example attached), the displayed gene labels are incorrect (see below). The problem was corrected if i added in a second field, e.g., /codon_start. Just a heads up for any other users of this excellent software, a new favorite of mine!

test2.txt

gene model colouring issue

Some gene models with high similarity scores remain grey. What could potentially cause this issue?

Examples in help

Finally got a chance to try this out, really nice tool! Thanks for all the work on it.
This isn't so much an issue as a suggestion:
The help could use some information about default settings used or examples (e.g. 0.3 for identity).

I could just be an idiot, but when I read identity I typed a whole number (i.e. 70) just to test out the tool, and it ended with no errors (just gave an image with no alignments found...because it was looking for 700% identity!!). As I say, perhaps I should have defaulted my brain to a decimal, but wasn't sure what to input after looking at the help/usage output and I imagine others will do the same thing.

I put a few suggested additions below, happy to submit a pull request for these if it helps, but thought I'd mention it here first.

Alignment options:
  -na, --no_align       Do not align clusters
  -i IDENTITY, --identity IDENTITY
                        Minimum alignment sequence identity [default: 0.3]
  -j JOBS, --jobs JOBS  Number of alignments to run in parallel (0 to use the
                        number of CPUs [default])

Output options:
  -s SESSION, --session SESSION
                        Path to clinker session
  -ji JSON_INDENT, --json_indent JSON_INDENT
                        Number of spaces to indent JSON [default: none]
  -f, --force           Overwrite previous output file
  -o OUTPUT, --output OUTPUT
                        Save alignments to file [example: alignments.txt]
  -p [PLOT], --plot [PLOT]
                        Plot cluster alignments using clustermap.js. If a path
                        is given, clinker will generate a portable HTML file
                        at that path. Otherwise, the plot will be served
                        dynamically using Python's HTTP server. [example: plot.html]
  -dl DELIMITER, --delimiter DELIMITER
                        Character to delimit output by [default: tab?]
  -dc DECIMALS, --decimals DECIMALS
                        Number of decimal places in output [default: 2]
  -hl, --hide_link_headers
                        Hide alignment column headers
  -ha, --hide_aln_headers
                        Hide alignment cluster name headers

Visualisation options:
  -ufo, --use_file_order
                        Display clusters in order of input files

Feature request: cluster granularity settings

Sometimes, it is useful to be able to differentiate between different clades or orthologous groups within a tandem array of homologous genes. See the example below, where a tandem array of homologous genes of interest (indicated in pink) comprise divergent clades / orthogroups.

It would be great if there was an option to ensure such divergent genes are clustered separately. I think the most elegant solution for this would be via fine-tuning the overall size of cluster groups (e.g. by altering some parameter in the clustering algorithm).

Thanks for your consideration

Error raised by low-homology references (ValueError: No alignments are stored in the aligner)

I have encountered an error message while processing some genome regions with clinker:

ValueError: No alignments are stored in the aligner

My understanding is that this error is being raised because there is a GBK in the list of inputs which does not have any alignment (that fall above the threshold of inclusion) with any other GBKs in the inputs.

I can definitely understand why this behavior may have been set up, with the justification that if a user provides a GBK then it should be included in the output. However, in my case I have been using clinker as the final visualization component of a larger workflow (BOFFO), which is able to identify homologous genome regions from a collection of bacterial genomes based on a small set of query sequences. In this approach, the inputs to clinker can easily have such distantly-related genes that they do not align above the threshold. For this purpose, it would be really nice if clinker just ignored any GBK inputs which do not align.

Would it be possible to have clinker ignore these missing alignments? Either by default or with an optional flag?

GFF+Fasta files without line wrapping doesn't display properly

Hi cameron, just came across another bug where a fasta file without any line wrapping results in an empty display:

Using a fasta formatted file with line wrapping works:

Issue with anaconda3 on mac and Clinker

Hi,

Clinker looks fantastic but I'm having trouble getting it going. I'm using a mac, I used pip to install Clinker:

$ pip install Clinker
Requirement already satisfied: Clinker in /anaconda3/lib/python3.7/site-packages (0.0.4)
Requirement already satisfied: numpy in /anaconda3/lib/python3.7/site-packages (from Clinker) (1.15.1)
Requirement already satisfied: scipy in /anaconda3/lib/python3.7/site-packages (from Clinker) (1.1.0)
Requirement already satisfied: biopython in /anaconda3/lib/python3.7/site-packages (from Clinker) (1.73)

but calling Clinker leads to a confusing error message:

$ Clinker
Traceback (most recent call last):
File "/anaconda3/bin/Clinker", line 6, in
from clinker.main import main
File "/anaconda3/lib/python3.7/site-packages/clinker/main.py", line 14, in
from clinker import align
File "/anaconda3/lib/python3.7/site-packages/clinker/align.py", line 20, in
from Bio.Align import substitution_matrices
ImportError: cannot import name 'substitution_matrices' from 'Bio.Align' (/anaconda3/lib/python3.7/site-packages/Bio/Align/init.py)

Edit: I can now go to /anaconda3/lib/python3.7/site-packages/Bio/Align/init.py and see this file, can't find any "substitution_matrices" string in this text file.

Any help much appreciated.

ValueError: sequence contains letters not in the alphabet

Hei!
I wanted to try out clinker, which is something that ive been longing for. I downloaded som gbff from ncbi, even renamed them gbk, but I get an error (header, and bottom). I havent done a thing with these, so howcome it doesnt work? Is there an open limit to genome size? These are mostly 5.5 Mb? Or is it not suitable for procaryotes?

(base) [annll@login ncbi-genomes-2020-11-16]$ ls
GCF_000003955.1_ASM395v1_genomic.gbk GCF_000497525.1_ASM49752v2_genomic.gbk GCF_002000005.1_ASM200000v1_genomic.gbk GCF_004119835.1_ASM411983v1_genomic.gbk
GCF_000008005.1_ASM800v1_genomic.gbk GCF_000832865.1_ASM83286v1_genomic.gbk GCF_002582095.1_ASM258209v1_genomic.gbk GCF_007682195.1_ASM768219v1_genomic.gbk
GCF_000161175.1_ASM16117v1_genomic.gbk GCF_001044825.1_ASM104482v1_genomic.gbk GCF_003426125.1_ASM342612v1_genomic.gbk md5checksums.txt
GCF_000161395.1_ASM16139v1_genomic.gbk GCF_001704095.1_ASM170409v1_genomic.gbk GCF_003612955.1_ASM361295v1_genomic.gbk README.txt
GCF_000186745.1_ASM18674v1_genomic.gbk GCF_001721165.1_ASM172116v1_genomic.gbk GCF_004101345.1_ASM410134v1_genomic.gbk
(base) [annll@login ncbi-genomes-2020-11-16]$ clinker *.gbk
[12:06:03] INFO - Starting clinker
[12:06:03] INFO - Parsing GenBank files: ['GCF_000003955.1_ASM395v1_genomic.gbk', 'GCF_000008005.1_ASM800v1_genomic.gbk', 'GCF_000161175.1_ASM16117v1_genomic.gbk', 'GCF_000161395.1_ASM16139v1_ genomic.gbk', 'GCF_000186745.1_ASM18674v1_genomic.gbk', 'GCF_000497525.1_ASM49752v2_genomic.gbk', 'GCF_000832865.1_ASM83286v1_genomic.gbk', 'GCF_001044825.1_ASM104482v1_genomic.gbk', 'GCF_0017 04095.1_ASM170409v1_genomic.gbk', 'GCF_001721165.1_ASM172116v1_genomic.gbk', 'GCF_002000005.1_ASM200000v1_genomic.gbk', 'GCF_002582095.1_ASM258209v1_genomic.gbk', 'GCF_003426125.1_ASM342612v1_ genomic.gbk', 'GCF_003612955.1_ASM361295v1_genomic.gbk', 'GCF_004101345.1_ASM410134v1_genomic.gbk', 'GCF_004119835.1_ASM411983v1_genomic.gbk', 'GCF_007682195.1_ASM768219v1_genomic.gbk']
/mnt/users/annll/.local/lib/python3.7/site-packages/Bio/Seq.py:2338: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N befo re translation. This may become an error in future.
BiopythonWarning,
[12:06:29] INFO - Starting cluster alignments
[12:06:29] INFO - GCF_000003955.1_ASM395v1_genomic vs GCF_000008005.1_ASM800v1_genomic
Traceback (most recent call last):
File "/mnt/users/annll/.local/bin/clinker", line 11, in
sys.exit(main())
File "/mnt/users/annll/.local/lib/python3.7/site-packages/clinker/main.py", line 153, in main
hide_alignment_headers=args.hide_aln_headers,
File "/mnt/users/annll/.local/lib/python3.7/site-packages/clinker/main.py", line 56, in clinker
globaligner = align.align_clusters(*clusters, cutoff=identity)
File "/mnt/users/annll/.local/lib/python3.7/site-packages/clinker/align.py", line 51, in align_clusters
aligner.align_stored_clusters(cutoff)
File "/mnt/users/annll/.local/lib/python3.7/site-packages/clinker/align.py", line 240, in align_stored_clusters
alignment = self.align_clusters(one, two, cutoff)
File "/mnt/users/annll/.local/lib/python3.7/site-packages/clinker/align.py", line 227, in align_clusters
aln = self.aligner.align(geneA.translation, geneB.translation)
File "/mnt/users/annll/.local/lib/python3.7/site-packages/Bio/Align/init.py", line 1592, in align
score, paths = _aligners.PairwiseAligner.align(self, seqA, seqB)
ValueError: sequence contains letters not in the alphabet
(base) [annll@login ncbi-genomes-2020-11-16]$

"This SVG is not valid, validate it before opening"

Hi,

I was able to successfully create a clinker html and edit it in the web interface. I clicked the 'save as an svg' button to do so, however any application that I try to open it with, including Adobe Illustrator 2021 and Inkscape gives me the message 'This SVG is not valid, validate it before opening'.

I tried AI and Inkscape on different computers with the same error. Web validators such as W3C markup do not detect a problem with the SVG either. Has this issue arisen before?

Is there any way to fix it or to save in another format such as pdf? I have attached the SVG as a TXT file below for reference.

Thanks,

Kathryn

SVG as a text file
clinker_test.txt

Clinker just stalls without producing output?

Hello,

I am running Clinker on a server (installed via conda) for five genomes (from the same genus); gbk files were downloaded from GenBank. I use the following command:
clinker ./genomes/*.gbk -p results.html
The problem is that after ~2.5 hours, it seems that the procedure stalls without producing any output. I found that during these 2.5 hours, python processes (one per CPU) were active, but then they stopped, and nothing happened after.
When I do not align the clusters (-na option), everything works, and the desired output file is produced (although without cluster alignments, it not useful).

Any ideas what may cause this?

Thank you!

Example GFF files

I am trying to get clinker to work with the output of Augustus. I see that the new GFF file support may make this much easier than trying to convert to GenBank format. Currently I cannot get Augustus gff files to work, and I have tried several converters with no luck yet. Can you add sample gff files to the examples folder?

Valid CDS identifier?

Hi Cameron,

Congratulations on this software! I find it very nice and would like to run it for my research. I have successfully installed it, but when I try to run it, I get the following error:

[07:26:04] INFO - Starting clinker
[07:26:04] INFO - Parsing GenBank files: ['S1_contig_41.region001.fixed.gbk', 'S2_contig_9.region001.fixed.gbk', 'S3_contig_26.region001.fixed.gbk', 'S4_c00019_NODE_19...region001.fixed.gbk', 'S5_c00001_NODE_1_...region001.fixed.gbk', 'S6_c00005_NODE_5_...region001.fixed.gbk', 'S7_scaffold2.region001.fixed.gbk']
Traceback (most recent call last):
File "/home/aberas2/miniconda3/bin/clinker", line 8, in
sys.exit(main())
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/main.py", line 153, in main
hide_alignment_headers=args.hide_aln_headers,
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/main.py", line 49, in clinker
clusters = parse_files(paths)
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 49, in parse_files
return [parse_genbank(path) for path in paths]
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 49, in
return [parse_genbank(path) for path in paths]
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 32, in parse_genbank
cluster = Cluster.from_seqrecords(*records, name=path.stem)
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 83, in from_seqrecords
loci = [Locus.from_seqrecord(record) for record in args]
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 83, in
loci = [Locus.from_seqrecord(record) for record in args]
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 124, in from_seqrecord
for feature in record.features
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 125, in
if feature.type == "CDS"
File "/home/aberas2/miniconda3/lib/python3.7/site-packages/clinker/classes.py", line 191, in from_seqfeature
"Could not determine a valid identifier"
ValueError: Could not determine a valid identifier from a CDS SeqFeature in c00019_NODE_19..

I have tried to remove this file that errors, and run the program. But I get the same error with the next file in line. I think it has to deal with the formatting in all my files. I would very much appreciate if you could advice me on how to fix this problem. I have uploaded one gbk file, so that you can have an idea of what they look like.

[S4_c00019_NODE_19...region001.fixed.gbk.txt] (https://github.com/gamcil/clinker/files/5501587/S4_c00019_NODE_19.region001.fixed.gbk.txt)

Thank you so much!

Elaborate on the input file contents

Hi, I'm trying to integrate clinker into a pipeline I'm working on. However I'm confused about a few things:

Should the input genbank files contain protein or DNA sequences?
Do we provide the entire gene cluster as a single sequence, and the entire file contains multiple clusters; or do we provide each gene as a single sequence, meaning that each file contains a single cluster?
If the entire cluster is a single sequence, how do you determine the position of the genes within that sequence?

pip3 install clinker instead pip install - Ubuntu 20.04 LTS

Hello, just an advice (because I had this "problem")

The last step of the instructions to install clinker says

pip install

however you should type instead (if you are using Ubuntu 20.04 LTS and the output says that pip was not found)

pip3 install clinker

I'm kind of new in Linux so if someone had this issue I hope this can help

Juan

Feature request: categorial color scheme for cluster groups

Currently, colors indicating cluster groups represent a rainbow gradient across the loci. Consequently, when analyzing large gene clusters, neighboring genes have very similar colors. This can make it difficult to assess whether neighboring genes belong to the same group or not. See example below.

I would therefore suggest implementing an optional different color scheme optimized for categorical data such as, e.g. https://vega.github.io/vega/docs/schemes/#categorical

In addition, it would be great if genes and links could be labeled according to the group name for increased readability also for colorblind persons.

Feature request: setting sequence window

For large genomes, one would often be interested in analyzing only a specific window of a chromosome while keeping the original coordinates. This would reduce runtime by excluding genes outside that window and simplify post-hoc manual adjustment of the plot.

It would therefore be great if it was possible to transmit the desired windows to be considered for analysis. For example, by calling something like e.g. clinker speciesA_chr01.gb:2500000-4000000 speciesB_chr08.gb:1500000-2000000 speciesC_contig003 -p

Cheers

Feature request: match gene order

Hi,
The recent updates to clinker are great. The javascript GUI is a little slow when I have ~60 samples of 5 or 6 genes each, so flipping the orientation of each gene fragment takes a while. It would be nice to have an option to pick a gene as a reference, and flip the orientation of all other strains contigs with that gene to match the orientation. Right now its just based on the gene orientation from each original genome. Alternatively, an option to minimize inversions would be cool, basically flipping contigs so that there are as few "twists" in the final output as possible.

Support for GFF+FAA files?

Since GenBank format isn't particularly user-friendly, please consider adding support for alternate input using GFF + FAA files. Your work on this tool is much appreciated.

Error when using gff file as input

Thank you for developing such a great tool.
I have a fasta file and a gff file that I want to use as input for clinker.
However, I got the following error.

....
....
[11:27:29] WARNING - Could not find parent gene of NODE_84_length_13759_cov_460.151_phanotate_53_geneCall_cds. Using coding sequence coordinates instead.
[11:27:29] WARNING - Could not find parent gene of NODE_84_length_13759_cov_460.151_phanotate_54_geneCall_cds. Using coding sequence coordinates instead.
[11:27:29] WARNING - Could not find parent gene of NODE_84_length_13759_cov_460.151_phanotate_55_geneCall_cds. Using coding sequence coordinates instead.
[11:27:29] WARNING - Could not find parent gene of NODE_84_length_13759_cov_460.151_phanotate_56_geneCall_cds. Using coding sequence coordinates instead.
Traceback (most recent call last):
  File "/home/wagatsuma/miniconda3/envs/clinker/bin/clinker", line 10, in <module>
    sys.exit(main())
  File "/home/wagatsuma/miniconda3/envs/clinker/lib/python3.9/site-packages/clinker/main.py", line 208, in main
    clinker(
  File "/home/wagatsuma/miniconda3/envs/clinker/lib/python3.9/site-packages/clinker/main.py", line 71, in clinker
    clusters = parse_files(paths)
  File "/home/wagatsuma/miniconda3/envs/clinker/lib/python3.9/site-packages/clinker/classes.py", line 200, in parse_files
    cluster = parse_gff(path)
  File "/home/wagatsuma/miniconda3/envs/clinker/lib/python3.9/site-packages/clinker/classes.py", line 68, in parse_gff
    gff = gffutils.create_db(
  File "/home/wagatsuma/miniconda3/envs/clinker/lib/python3.9/site-packages/gffutils/create.py", line 1292, in create_db
    c.create()
  File "/home/wagatsuma/miniconda3/envs/clinker/lib/python3.9/site-packages/gffutils/create.py", line 507, in create
    self._populate_from_lines(self.iterator)
  File "/home/wagatsuma/miniconda3/envs/clinker/lib/python3.9/site-packages/gffutils/create.py", line 629, in _populate_from_lines
    raise ValueError("No lines parsed -- was an empty file provided?")
ValueError: No lines parsed -- was an empty file provided?

I would appreciate it if you could let me know if there is any solution.
I have attached a part of the gff file used for input below.

##gff-version 3
NODE_33_length_7462_cov_1658.36 PhATE   gene    2       163     .       +       .       ID=NODE_33_length_7462_cov_1658.36_phanotate_1_geneCall
NODE_33_length_7462_cov_1658.36 PhATE   CDS     2       163     .       +       .       ID=NODE_33_length_7462_cov_1658.36_phanotate_1_geneCall_cds
NODE_33_length_7462_cov_1658.36 PhATE   gene    362     577     .       -       .       ID=NODE_33_length_7462_cov_1658.36_phanotate_2_geneCall
NODE_33_length_7462_cov_1658.36 PhATE   CDS     362     577     .       -       .       ID=NODE_33_length_7462_cov_1658.36_phanotate_2_geneCall_cds
NODE_33_length_7462_cov_1658.36 PhATE   gene    902     1042    .       -       .       ID=NODE_33_length_7462_cov_1658.36_phanotate_3_geneCall
NODE_33_length_7462_cov_1658.36 PhATE   CDS     902     1042    .       -       .       ID=NODE_33_length_7462_cov_1658.36_phanotate_3_geneCall_cds

Best,

Maximum recursion depth exceeded

I recently ran into this error:

Traceback (most recent call last):
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/bin/clinker", line 11, in <module>
    load_entry_point('clinker==0.0.15', 'console_scripts', 'clinker')()
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/main.py", line 218, in main
    clinker(
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/main.py", line 81, in clinker
    globaligner = align.align_clusters(*clusters, cutoff=identity, jobs=jobs)
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/align.py", line 55, in align_clusters
    aligner.align_stored_clusters(cutoff, jobs=jobs)
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/align.py", line 399, in align_stored_clusters
    self.build_gene_groups()
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/align.py", line 408, in build_gene_groups
    for genes in consolidate(links):
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/align.py", line 68, in consolidate
    r, b = [arr[0]], consolidate(arr[1:])
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/align.py", line 68, in consolidate
    r, b = [arr[0]], consolidate(arr[1:])
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/align.py", line 68, in consolidate
    r, b = [arr[0]], consolidate(arr[1:])
  [Previous line repeated 990 more times]
  File "/Users/schanana/Documents/qw-umonospora+b482/antismash/clinker/lib/python3.8/site-packages/clinker/align.py", line 66, in consolidate
    if len(arr) < 2:
RecursionError: maximum recursion depth exceeded while calling a Python object

I was analyzing 11 gbk files. Once I removed 3 of them, it ran without problems. Is there a maximum limit to the number of files it can process at once?

my input command was:

$  clinker ./*.gbk -f -ufo -p ./result-f-ufo-p.html -o ./alignments-f-ufo-p.txt

pip install clinker - ERROR: Command errored out with exit status 1

Hello,

I tried using pip install clinker as described in the installation instructions - but had a long error message, ending with:

ERROR: Command errored out with exit status 1: 'c:\users\trb18198\appdata\local\programs\python\python39\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\trb18198\AppData\Local\Temp\pip-install-_rw5tr0s\biopython\setup.py'"'"'; file='"'"'C:\Users\trb18198\AppData\Local\Temp\pip-install-_rw5tr0s\biopython\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\trb18198\AppData\Local\Temp\pip-record-oxf1pagw\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\trb18198\appdata\local\programs\python\python39\Include\biopython' Check the logs for full command output.

This bit is highlighted in white in the error message (so it may be important?)
ERROR: Failed building wheel for biopython
Running setup.py clean for biopython
Failed to build biopython
Installing collected packages: biopython, clinker
Running setup.py install for biopython ... error

I am not quite sure what I've done wrong....?

feature request: label options and save/load

Clinker is great and I've already found a great use for it in my research. Its very easy to use and makes really nice results.

I have two requests for features. One is that it would be very nice to be able to select the label that is displayed. Currently it seems to default to a hierarchy of protein_id and then locus_tag. It would be nice to be able to select which is displayed, especially if I could choose the "gene=" tag if present. Alternatively, a table linking locus_tags to a custom label as input would be nice.

The second request is to be able to save and load the visualization state. It would be nice to be able to run the analysis on a headless cluster and then load it in the browser to visualize. Likewise it would be nice to be able to save the visualization state to a file and reload it later to make further changes. Edit: I just realized this is possible by using the -p option to save to an html file. I should have tried that first!

[Feature request] option to indicate subsection of Genbank file

Often we have a (prokka-generated) Genbank file, but only want to use a small subsection of it in Clinker. I don't know how easy it would be, but could an option be introduced to say xxxxx.gbk(133310-136309) so that only that region and the annotation are used?

Feature Request: Incomplete/disrupted genes in the figure

Hey, I was assembling IncF plasmids lately. I wanted to do a figure of a region responsible for conjugative transfer and these plasmids have more genes disrupted by ISs. It would be nice if there was a way to include incomplete genes/CDS in the figure to highlight some traits, e.g. the lost of conjugation transfer (in my case).

Thanks for consideration.

Bioconda submission naming

Hello!

I'm going to submit a recipe to Bioconda, but there is already a recipe for another tool called Clinker (https://anaconda.org/bioconda/clinker).

Do you have a preferred alternative name for the bioconda recipe (e.g. clinker-py, clinker-fig, python-clinker)

Thank you!
Robert

gamcil / clinker Goto Github PK

clinker's Introduction

Hi there 👋

Some projects I've worked on:

If you want to support me:

clinker's People

Contributors

Stargazers

Watchers

Forkers

clinker's Issues

Recommend Projects

Recommend Topics

Recommend Org