mmcguffi / plannotate Goto Github PK

View Code? Open in Web Editor NEW

94.0 94.0 20.0 47.59 MB

Webserver and command line tool for annotating engineered plasmids

License: GNU General Public License v3.0

Python 91.55% HTML 5.79% Jupyter Notebook 2.66%

plasmid-annotation webserver

plannotate's People

Contributors

Stargazers

Watchers

Forkers

lzh93 koschink dr3y croots tangweijr bjornfjohansson infiniterik riccardosabatini ivanv87 khenesey nh13 barricklab tnrich fmaguire nitro-bio

plannotate's Issues

Conda/mamba installation appears to be broken

Installed via conda as per the instructions in the README:

mamba create -n plannotate -c conda-forge -c bioconda plannotate

which resulted in version 1.2.0 being installed. I then (after failing with my own plasmids) downloaded pUC19.fa from the data directory in this github repo, and tried annotating it and this is what I get:

(plannotate) fennell@x86_64 /tmp $ plannotate batch --input pUC19.fa --html --output /tmp
2023-04-07 09:15:03.772
  Warning: to view this Streamlit app on a browser, run it with the following
  command:

    streamlit run /Users/fennell/conda.x86/envs/plannotate/bin/plannotate [ARGUMENTS]
Traceback (most recent call last):
  File "/Users/fennell/conda.x86/envs/plannotate/bin/plannotate", line 10, in <module>
    sys.exit(main())
  File "/Users/fennell/conda.x86/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/fennell/conda.x86/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/fennell/conda.x86/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/fennell/conda.x86/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/fennell/conda.x86/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/fennell/conda.x86/envs/plannotate/lib/python3.10/site-packages/plannotate/pLannotate.py", line 116, in main_batch
    recordDf = annotate(inSeq, yaml_file, linear, detailed)
  File "/Users/fennell/conda.x86/envs/plannotate/lib/python3.10/site-packages/plannotate/annotate.py", line 355, in annotate
    blastDf = clean(blastDf)
  File "/Users/fennell/conda.x86/envs/plannotate/lib/python3.10/site-packages/plannotate/annotate.py", line 178, in clean
    rowSlice = (seqSpace[columnSlice] == kind).any(1) #only the rows that are in the columns of hit
TypeError: NDFrame._add_numeric_operations.<locals>.any() takes 1 positional argument but 2 were given

use my own annotation file

Hi,
I am interested in using PLannotate on my own annotated file (e.g., in genebank format), how can I add this into the database? what formats are accepted in the database folder?
Thanks, appreciate it.

Location of blast databases

Hello,

I am not able to find the location of the blast databases. Can you please guide me on this?

Best wishes,

AttributeError: module 'streamlit' has no attribute 'cli'

Found this error while trying a local install. This seems to have been caused by an update to streamlit and so I found and used the solution here: streamlit/streamlit#5146 (comment)

The change required is to change all references to streamlit.cli to streamlit.web.cli

Entry size is too large -- must be 50000 bases or less.

Hi,

Whether the limitation of plasmid size could be removed in web server?
Thank you

Amber

Combined Annotations from pLannotate running locally

When running the pLannotate gui app, there is an option to download a gbk file with combined annotations.

When I do the annotation locally:

plannotate batch --input pTA1_FASIIb.gb --html --detailed

The features in the original file are lost.

Is there an option for this?

any() takes 1 positional argument but 2 were given error

Hi, thanks for the great software.
I was testing pLannotate on my plasmids when I hit this error. I also get this error using the pUC19 fasta file from the repository, so I am presuming that this is not a problem from my own plasmids. The pLannotate was installed through conda.


% plannotate --version
plannotate, version 1.2.0

% plannotate batch -i tmp/pUC19.fa --output tmp -d --file_name tmp_plannotate --html
2023-04-28 09:16:35.350
  Warning: to view this Streamlit app on a browser, run it with the following
  command:

    streamlit run /home/abs/anaconda3/envs/bio/bin/plannotate [ARGUMENTS]
Traceback (most recent call last):
  File "/home/abs/anaconda3/envs/bio/bin/plannotate", line 10, in <module>
    sys.exit(main())
  File "/home/abs/anaconda3/envs/bio/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/abs/anaconda3/envs/bio/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/abs/anaconda3/envs/bio/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/abs/anaconda3/envs/bio/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/abs/anaconda3/envs/bio/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/abs/anaconda3/envs/bio/lib/python3.8/site-packages/plannotate/pLannotate.py", line 116, in main_batch
    recordDf = annotate(inSeq, yaml_file, linear, detailed)
  File "/home/abs/anaconda3/envs/bio/lib/python3.8/site-packages/plannotate/annotate.py", line 355, in annotate
    blastDf = clean(blastDf)
  File "/home/abs/anaconda3/envs/bio/lib/python3.8/site-packages/plannotate/annotate.py", line 178, in clean
    rowSlice = (seqSpace[columnSlice] == kind).any(1) #only the rows that are in the columns of hit
TypeError: any() takes 1 positional argument but 2 were given

[Request] Add feature annotations in the .gbk map

Hello,

I was wondering if it would be possible to add the "feature description" into the "/note" section of the feature so they are conserved in the GenBank file. It would be conveninet to have these longer descriptions instead of just the feature name, especially when using Snapgene or other visualization programs.

custom database installation

Hey,
I am trying to set up a custom blast database and run pLannotate using a custom yaml_file but run into some issues
I have a fasta file mtcsb_parts.fasta containing my custom nucleotide sequences:

>1
NNNNNNN
>2
NNNNNNN

I create the blast database using:
makeblastdb -in /Users/ruprec01/Documents/Faith_lab/Git/blastdb/mtcsb_parts/mtcsb_parts.fasta -title "mtcsb_parts" -dbtype nucl
I have a mtcsb_parts.csv file containing descriptions of the sequneces in the same path:

sseqid,Feature,Type,Description
1,feature1,type1,descript1
2,feature2,type2,descript2

I create a custom_yaml file, that contains the entry

mtcsb_parts:
  details:
    compressed: false
    default_type: None
    location: /path-to-folder/mtcsb_parts
  location: /path-to-folder/mtcsb_parts
  method: blastn
  parameters:
  - -perc_identity 95
  priority: 1
  version: Downloaded 2021-07-23

I run plannotate using in conda using:

plannotate batch -i test.fasta \
--yaml_file plannotate_custom.yaml \
--output /output

I get the following error:

streamlit run /Users/ruprec01/opt/anaconda3/envs/plannotate/bin/plannotate [ARGUMENTS]
Traceback (most recent call last):
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/bin/plannotate", line 10, in <module>
  sys.exit(main())
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
  return self.main(*args, **kwargs)
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 1053, in main
  rv = self.invoke(ctx)
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
  return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
  return ctx.invoke(self.callback, **ctx.params)
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/click/core.py", line 754, in invoke
  return __callback(*args, **kwargs)
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/plannotate/pLannotate.py", line 180, in main_batch
  gbk = rsc.get_gbk(recordDf, inSeq, kwargs["linear"])
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/plannotate/resources.py", line 120, in get_gbk
  record = get_seq_record(inDf, inSeq, is_linear, record)
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/plannotate/resources.py", line 151, in get_seq_record
  inDf["feat loc"] = inDf.apply(FeatureLocation_smart, axis=1)
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/pandas/core/frame.py", line 3940, in __setitem__
  self._set_item_frame_value(key, value)
File "/Users/ruprec01/opt/anaconda3/envs/plannotate/lib/python3.10/site-packages/pandas/core/frame.py", line 4094, in _set_item_frame_value
  raise ValueError(
ValueError: Cannot set a DataFrame with multiple columns to the single column feat loc

I am wondering if you can help me out with how to create the blastdatabase properly and add the correct entry into the custom yaml file. plannotate works as soon as I add for example the snapgene entry back into the custom yaml file.
Thanks for any help, really love pLannotate!
Greetings,
Constantin

batch mode

Would be great if for the local installation it could be possible to process files in batch mode. Some instructions to do it would be cool.

Another thing will good if users could create their own add on database, some of the parts we use in our lab are not picked up

I so far managed to run it within Docker, however could facilitate portability if there is a full docker implementation of the tool. However im so far impressed with the tool and very thankful that you developed such a cool tool.

manual install broken

error in manual install:

==> python setup.py install
['tabulate >=0.8.9', 'streamlit >=1.8.1', 'biopython>1.77', 'bokeh=2.4.1']
error in plannotate setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers; Invalid requirement, parse error at "'=2.4.1'"

to fix changed:
https://github.com/barricklab/pLannotate/blob/03417a3991558fd2aef8cc68f9cf3d45853b0a6c/requirements.txt#L5

to be:
bokeh==2.4.1

Plasmid name defaults to "plasmid"

Hi!
After running pLannotate on my plasmid sequence, the plasmid name in the LOCUS defline by default "plasmid". Would it be possible to pass through the name of the sequence in the input fasta file to the plasmid name?

annotations start one base pair too late when reverse

Just installed this great software but noticed a weird problem! All my reverse annotations are starting one base pair later than they should be.
reverse:

forward:

here is an example sequence: https://gist.github.com/dr3y/b3eac9953cb4808fb875d2d273d2ebe3

Annotating a large batch of plasmid reads

Hey @jeffreybarrick,

I wanted to get your thoughts before I make this PR, which will only take me a few hours. The idea would be to able to annotate a large batch of plasmid sequences, for example tens of thousands. This would be useful when QC'ing individual long-reads sequencing a colony of plasmids, prior to assembly of the various populations.

I would modify the annotate and accompanying methods to accept a batch of queries, versus a single one now, as well as explicitly set threading on the various blast/diamond/cmscan tools. I would then modify the main "batch" method to accept a FASTA or FASTQ that could have more than one sequence (I would use pysam for parsing, so a new dependency). I would also add a bit of caching (opt-in of course). Finally, I would create HTML and GBK files (when the cli commands are set) on a per-read basis for QC. I'd probably add a bit of progress logging too.

Thoughts?

Input file size

Hi,

I found plannotate to be very useful for my work but I am struggling to get it work for larger plasmids > 50,000bp. Is there a workaround for this situation or can you recommend another tool to explore that option?

Thanks!

Let's get this on Bioconda!

Hi @mmcguffi, I'm creating this issue to work on getting this added to Bioconda. I think it would be straightforward.

One thing I was wondering about, would you be up for creating a plannotate download function to download the databases and put them in the expected location? I think this would make pLannotate easier to use for novice users, but also make the download process more standardized.

Curious on your thoughts about this

Instructions for adding custom database?

Hi, I want to use plannotate with my desired genome (like zebrafish, drosophila, etc.).

I checked the BLAST_dbs.tar.gz file and I noticed the plannotate internally uses infernal(for RNA) and diamond(for DNA).
So, If I generate my own diamond db with a genome of interest, can I use it with plannotate?

I also checked plannotate_default.yaml file, but I couldn't find out instructions...
thanks!

ModuleNotFoundError: No module named 'streamlit.cli'

Trying to run pLannotate via conda and I get the error:

ModuleNotFoundError: No module named 'streamlit.cli'

After a little digging it looks like the streamlit.cli module was moved to streamlit.web.cli in the newer versions of streamlit.

I suggest changing the import or locking the streamlit version to an earlier version.

It worked for me when I used streamlit version 1.10.0

Custom databases

Adding functionality for users to specify custom databases, as well as increasing ease of modularly extending pLannotate in the future.

Currently the idea is to create a separate file, perhaps YAML or similar format, that specifies:

database location
type of database
optional location of matching descriptions
optional parameters

Any other ideas or desired functionality is much appreciated.

Installation instructions a bit unclear

Great tool, thanks for putting it together.

I have a few comments regarding the readme:

the 'releases' link leads to the first release which does not have a blast database
gunzipping the blast database results in a BLAST_dbs folder which also has another BLAST_dbs folder in it -- maybe it's helpful to clarify which needs to be in the pLannotate folder
a step missing after activating the Conda environment: we need to install the package
the command line batch example does not work without also specifying the path to the database with -b

AttributeError: module 'streamlit' has no attribute 'cli'

Hi,

I got an error when starting plannotate after installing it via mamba. I changed "import streamlit.cli" to "import streamlit.web.cli" and got it working again.

Could you perhaps tell me how i can start the application on a different port then 8501? I have some other streamlit applications running on other ports and there i just specify the port by adding ex. --server.port 7501. But this does not seem to work with plannotate.

Regards
Nicolas

Converting html map to image?

Hi @mmcguffi ,
Is it possible to convert the html annotation map file to an image?
I tried wkhtmltoimage but unfortunately it just gives a blank file.
It seems the image is being made dynamically.

Let me know if you have any insights into this.

Retaining ORFs?

HI!
I've noticed that when you have an ORF that can't be annotated with any of the databases currently in set, pLannotate does not produce an CDS feature. Adding a custom database via the -y option did produce what I wanted, but would it be possible to add predicting ORFs, for say longer than 100bps?

Finding nested features

pLannotate currently does not report valid features that are completely nested within a larger feature. For example, the SV40 origin of replication contained within the SV40 promoter, which currently pLannotate only reports the larger SV40 promoter.

I will address this by:

improving the filtering algorithm so it does not drop nested features that are of a separate "type" (as specified by DDBJ/ENA/GenBank)

Issue while running plannotate inside singularity container

Hi! Thanks for the great tool.

I am trying to run plannotate batch in a singularity container and I'm facing the below error.
My input file is a fasta file and the command is as below
plannotate batch -h -i reference.fasta

reference
TCGCGCGTTTCGGTGATGACGGTGAAAACCTCTGACACATGCAGCTCCCGGAGACGGTCA
CAGCTTGTCTGTAAGCGGATGCCGGGAGCAGACAAGCCCGTCAGGGCGCGTCAGCGGGTG
TTGGCGGGTGTCGGGGCTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGC
ACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGCGCC
ATTCGCCATTCAGGCTGCGCAACTGTTGGGAAGGGCGATCGGTGCGGGCCTCTTCGCTAT
TACGCCAGCTGGCGAAAGGGGGATGTGCTGCAAGGCGATTAAGTTGGGTAACGCCAGGGT
TTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTGGCGCGCCACATTGATTATTGACTA
GTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGTTCCGCG
TTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGCCCAACGACCCCCGCCCATTGA

This error is not seen when I run in docker container.
Hope you can help me with this issue.
Thanks

Entry size is too large -- must be 50000 bases or less

Hi @mmcguffi,
I am trying to annotate plasmids using this tool. So far it had worked pretty well but it is throwing error for plasmids larger than 50Kb. I tried the method mentioned in one of the issue which was building the tool from the source but it is still throwing me the same error. I tried manipulating the script a bit but it didn't gave the expected result.
So I request you to please help me out whenever its possible.
Please let me know if you need any other information from my side.
Thank you in advance.

Issue with streamlit.cli from conda installation

Hi,

I have installed pLannotate from conda and I am getting the following error message:

File "(...)/miniconda3/envs/plann/lib/python3.11/site-packages/plannotate/pLannotate.py", line 5, in
import streamlit.cli
ModuleNotFoundError: No module named 'streamlit.cli'

According to this issue in github, it seems that:
"If anyone stumbles across this, the streamlit.cli module was moved to streamlit.web.cli."

Best,

Edit: if I install it with mamba, it gets a different version of streamlit that works. But sometimes I get an error with altair version 5 (if it gets version 4, it works). Maybe "fixing" the versions of streamlit and altair in the recipe should fix the problem?

DIAMOND does not find all potential hits

Currently, DIAMOND only reports the top 25 hits which was an oversight in initial development and can lead to missing annotations.

This will be addressed by:

reporting all hits
upgrading the DIAMOND version to 2.0.x (algorithmic improvements that find more hits)
using the full Swiss-Prot database (currently the database is trimmed)