Git Product home page Git Product logo

chewbbaca's Introduction

PyPI Bioconda Conda chewBBACA Documentation Status License: GPL v3 DOI:10.1099/mgen.0.000166

chewBBACA

chewBBACA is a software suite for the creation and evaluation of core genome and whole genome MultiLocus Sequence Typing (cg/wgMLST) schemas and results. The "BBACA" stands for "BSR-Based Allele Calling Algorithm". BSR stands for BLAST Score Ratio as proposed by Rasko DA et al.. The "chew" part adds extra coolness to the name and could be thought of as "Comprehensive and Highly Efficient Workflow". chewBBACA allows to define the target loci in a schema based on multiple genomes (e.g. define target loci based on the distinct loci identified in a dataset of high-quality genomes for a species or lineage of interest) and performs allele calling to determine the allelic profiles of bacterial strains, easily scaling to thousands of genomes with modest computational resources. chewBBACA includes functionalities to annotate the schema loci, compute the set of loci that constitute the core genome for a given dataset, and generate interactive reports for schema and allele calling results evaluation to enable an intuitive analysis of the results in surveillance and outbreak detection settings or population studies. Pre-defined cg/wgMLST schemas can be downloaded from Chewie-NS or adapted from other cg/wgMLST platforms.

Check the documentation for implementation details and guidance on using chewBBACA.

News

3.3.5 - 2024-04-18

  • Added function to check if input files passed to the CreateSchema and AlleleCall modules have unique prefixes longer than 30 characters (the prefix includes everything in the basename before the first .). The process prints a message with the list of input files with a prefix longer than 30 characters and exits.

  • Fixed issue in the AlleleCall module when running in mode 1 (trying to write the file with the list of invalid CDSs, but the data is not available when running in mode 1).

  • Added more tests and improved test scripts.

  • Simplified the help message for all modules.

Check our Changelog to learn about the latest changes.

Citation

When using chewBBACA, please use the following citation:

Silva M, Machado MP, Silva DN, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço JA. 2018. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. Microb Genom 4:000166. doi:10.1099/mgen.0.000166

chewbbaca's People

Contributors

andersgs avatar cimendes avatar dependabot[bot] avatar dorbarker avatar jacarrico avatar mickaelsilva avatar odiogosilva avatar pedrorvc avatar ramirma avatar rfm-targa avatar thebready avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chewbbaca's Issues

pMLST schemes have 'no correct alleles' and are removed

Dear chewBBACA developers,

Today I wanted to run PrepExternalSchema on a set of pMLST files downloaded from the pubMLSt database (https://pubmlst.org/bigsdb?db=pubmlst_plasmid_seqdef&page=downloadAlleles - renamed so that the files end on ".fasta"). I tried both IncI and IncF; but in both cases chewBBACA outputs:
"gene.fasta has no correct aleles, the file will be removed!!" for all genes in the scheme (and wipes all my files!).
I have previously used your software with a large cgMLST scheme downloaded from Enterobase and encountered no such problems.

(i) Do you know why these files are not being recognised properly by chewBBACA?
(ii) Why is the default behaviour to remove the content of all files without prior warning?

Consistency of novel allele nomenclature

Is the nomenclature for novel alleles kept consistent across different runs? If I am using the same scheme for each run, will for example, INF_100 identified in locus A in one run be equivalent to INF_100 identified in locus A in another run?

installation problem

Hello there! I run int a problem when installing with pip3:

Collecting chewbbaca
  Using cached https://files.pythonhosted.org/packages/b1/79/a5422033716f970b3d4afaaafd9f8112da6604721a1819f8c7c6e87c6948/chewBBACA-2.1.0-py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): requests>=2.2.1 in /usr/lib/python3/dist-packages (from chewbbaca)
Collecting scipy>=0.13.3 (from chewbbaca)
  Using cached https://files.pythonhosted.org/packages/c1/60/8cbf00c0deb50a971e6e3a015fb32513960a92867df979870a454481817c/scipy-1.4.1-cp35-cp35m-manylinux1_x86_64.whl
Collecting biopython>=1.70 (from chewbbaca)
  Using cached https://files.pythonhosted.org/packages/59/8f/454d961e821d5f600eb59885dc32aa39e3f226357f5d18a839d7ae088722/biopython-1.76-cp35-cp35m-manylinux1_x86_64.whl
Collecting numpy>=1.14.0 (from chewbbaca)
  Using cached https://files.pythonhosted.org/packages/45/25/48e4ea892e93348d48a3a0d23ad94b176d6ab66084efcd881c78771d4abf/numpy-1.18.3-cp35-cp35m-manylinux1_x86_64.whl
Collecting pandas>=0.22.0 (from chewbbaca)
  Using cached https://files.pythonhosted.org/packages/2f/79/f236ab1cfde94bac03d7b58f3f2ab0b1cc71d6a8bda3b25ce370a9fe4ab1/pandas-1.0.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-tl3xfffe/pandas/setup.py", line 42
        f"numpy >= {min_numpy_ver}",
                                  ^
    SyntaxError: invalid syntax

Could you kindly suggest a fix for this?

Regards. Stan

FileNotFound error

Hi,

I am running chewBBACA version 2.0.12 (installed through conda), and am getting an error with AlleleCalli mode. I have downloaded the cgMLST scheme for Listeria monocytogenes from https://www.cgmlst.org/ncs, which I then ran PrepExternalSchema on, which completed successfully. I see that there are directories short and temp in the schema folder.

However when running the command:

chewBBACA.py AlleleCall -i listeria_assemblies -g listeria_cgmlst -o listeria_chewbbaca --cpu 1 --ptf Listeria_monocytogenes.trn -v --fr

It fails, with the following error:

Finished Allele Calling at : 21:06:44-17/09/2018
Wrapping up the results
[Errno 2] No such file or directory: '/lustre/scratch118/infgen/team81/jl11/pdoc/assigning/revisions/listeria_cgmlst/temp/lmo0001.fasta_result.txt'

        USAGE : chewBBACA.py [module] -h

Select one of the following functions :

CreateSchema : Create a gene by gene schema based on genomes
AlleleCall : Perform allele call for target genomes
SchemaEvaluator : Tool that builds an html output to better navigate/visualize your schema
TestGenomeQuality : Analyze your allele call output to refine schemas
ExtractCgMLST : Select a subset of loci without missing data (to be used as PHYLOViZ input)
RemoveGenes : Remove a provided list of loci from your allele call output
PrepExternalSchema : prepare an external schema to be used by chewBBACA
JoinProfiles : join two profiles in a single profile file
UniprotFinder : get info about a schema created with chewBBACA

The directory noted does exist, and contains three types of file (a single example of each, but looks like they are there for every gene and isolate):

lmo0901.fasta_argList.txt
19183_4#78.contigs_velvet.fa
19183_4#78.contigs_velvet.fa_ORF.txt

Any help much appreciated!

blast version detection broken

Hello,

while giving a try at chewBBACA I had the following message

Something went wrong. Your blast version is blastp: 2.10.0+
 Package: blast 2.10.0, build May 23 2020 11:08:43

Update your blast to 2.5.0 or above. Exited.

due to
blast version detection in CHEWBBACA/allelecall/BBACA.py
blast_version_pat = re.compile(r'2.[5-9]')

pip3 install doesn't work - could not find version

% pip3 install chewBBACA
Collecting chewBBACA
  Could not find a version that satisfies the requirement chewBBACA (from versions: )
No matching distribution found for chewBBACA

% pip3 search chewBBACA
chewBBACA (2.0.5)  - A complete suite for gene-by-gene schema creation and strain identification

Any ideas?

I notice the instructions say chewbbaca, the repo is chewBBACA and the module is CHEWBBACA - could that be it?

PrepExternalSchema not running for conda build

Hello,

I have installed the conda environment for chewbbaca and am trying to use an external Salmonella wgMLST schema from Enterobase. I am running the PrepExternalSchema function on the downloaded fasta files as follows:

chewbbaca PrepExternalSchema -i PATH-TO-DIRECTORY --cpu 4

Whenever I do so, I just get the help message for chewbacca as output and the function doesn't run.

I can see that inside of the main function in chewBBACA.py, the exception is being caught when prep_schema() is run (which would be why the help message is printed). Inside of prep_schema(), it is line 360 proc = subprocess.Popen(args) that is not working. These lines seem to have been changed in the current version of chewBBACA on GitHub, so perhaps this has been fixed (just not in the conda build).

Please let me know how you recommend handling the situation and what I can do to get chewBBACA up and running for my analysis.

Best wishes,
Kristen

Cannot run multiple AlleleCall programs at the same time

When I run two command line like bellow:
python3 CHEWBBACA/chewBBACA.py AlleleCall -i M04-34.contigs.fa -g
Clostridioides_difficile/genes -o test0 --cpu 4
python3 CHEWBBACA/chewBBACA.py AlleleCall -i contigs.fasta -g tempgenes/genes/ -o test3 --cpu 4

-i , -g -o are different,but the first one can't run successfully ,only the last one is completed.

If they are executed sequentially, they can be successfully completed ,is there any problem?

results file for phyloviz

Hi, I try to draw phylogenetic trees with phyloviz which needs two files
image. I wonder if the file results_alleles.tsv can be used directly. I also compared the two files with results_alleles.tsv and find that the results lack information about ST types

image
would you like to teach me how to process the results so that it can be used in phyloviz

Question about cgMLST analysis

Hi,
Before posting here I have been through the tutorial and the user guide but I think I missed the information which I need to perform cgMLST analysis for some isolates (targets). I want to be sure about how exactly the pipeline should be run in order to obtain cgMLST profile of target genomes. To create cgMLST schema we picked some complete reference genomes. Then I have executed following set of commands:

#Creating schema using Ref genomes
chewBBACA.py CreateSchema -i $ccoli_ref_genomes -o ${out_ccoli}/1_seed_schema --ptf ${training}/c_coli.trn --cpu 24

#Now allele calling from Ref genomes
chewBBACA.py AlleleCall -i $ccoli_ref_genomes -g ${out_ccoli}/1_seed_schema/ -o ${out_ccoli}/2_wg_ref_allele --cpu 24 --ptf ${training}/c_coli.trn

#Remove paralogous genes from the wgMLST schema
cd ${out_ccoli}/2_wg_ref_allele/results_*/
chewBBACA.py RemoveGenes -i results_alleles.tsv -g RepeatedLoci.txt -o alleleCallMatrix_wg_ref_coli

#Define cgMLST schema by selecting the loci that are present in the Ref genomes, 95 %, by the use of the TestGenomeQuality
chewBBACA.py TestGenomeQuality -i alleleCallMatrix_wg_ref_coli.tsv -n 13 -t 200 -s 5 chewBBACA.py ExtractCgMLST -i alleleCallMatrix_wg_ref_coli.tsv -g removedGenomes.txt -o cgMLST_refgenomes -p 0.95

#Allele calling for target genomes
chewBBACA.py AlleleCall -i $query_genomes_ccoli -g listgenes_core.txt -o cgMLST_querygenomes --cpu 24 --ptf ${training}/c_coli.trn

cd cgMLST_querygenomes/results_*/
chewBBACA.py ExtractCgMLST -i results_alleles.tsv -o cgMLST_95 -p 0.95

Is this the correct way to do it?

The last step produces files cgMLST.tsv and cgMLSTschema.txt. So I guess cgMLST.tsv is the file that goes to Phyloviz program, right?

/adnan

[Errno 2] No such file or directory: ' '

I'm getting this error after 'checking gene files exist...' with chewBBACA 2.0.17.2 (anaconda) in a parallel interactive SGE HCC node (2 cores).

ChewBBaca_issue

chewbbaca_testrun_inlist.txt is a single-line list with the full path to a genome (contigs) and ./schema/ is a directory with only the multi-fasta files from here (https://www.cgmlst.org/ncs/schema/1025099/locus/). I haven't checked all 1521 files in ./shema/ manually but can't find anything unusual with quick grep commands. All paths exist and I'm running anaconda 5.3.1, Blast2.9.0+ and Prodigal 2.6.3. I'm not sure what 436 refers to.

I haven't used chewBBACA or done cgMLST calling before, so I apologise if it's just me doing something really daft but can't seem to troubleshoot this one.

Minimum number of genomes?

Hi!

Does the chewBBACA have minimum number of genomes requirement? I was trying to use it to exctract cgMLST between the two isolates but I encountered following error while using RemoveGenes chewBBACA.py: error: the following arguments are required: -i although -i was provided.

Format change request for results_alleles.tsv

We are working on clustering data from chewbbaca results, and we are using the results_alleles.tsv file. We here discovered that when a new allele is found, it is listed as INF-number. Then, the next time this one is seen, only the number is listed. Thus, inside this file the same allele is given effectively two different names.

Our suggestion is that you instead give all instances of the new allele the INF-number name, which would make it easy to figure out for each isolate which ones are new and which ones are not new. In addition, it would mean that when using it for distance matrices, the fields would be regarded as "the same".

Allele id affects tree structure

Hi,

When I check the result profile from cgMLST, I found the assigned allele id for each strains does not actually reflect the sequence similarity but is the input order of the strain. So if the strain is the first input of the analysis, then all the allele id of each locus of this strain will have small number, e.g. 1, 2. If another strain which is the 1000th input, and all locus are different from the schema, then the allele id of each locus of this strain will have large number. If this is true, the sequence similarity between two allele with id 1 and 2 is actually the same as two allele with id 1 and 1000. I am worried that this will affect the reproducibility of the result, so with a cgMLST profile of 57 Yersinia strains, I random shuffle the allele id for each strain in each locus, and build RapidNJ Tree and Minimum Spanning Tree. I do found the tree structure changed, though not dramatically.

Am I get misunderstanding of the cgMLST?

Unable to prepare schema from PubMLST schema profile of Helicobacter pylori

I tried the schema preparation from PubMLST schema profile of Helicobacter pylori by using
chewBBACA.py PrepExternalSchema -i hpylori_190410 -v
, then I got error
ATTENTION!!!111 hpylori_190410/atpA.tfa.fasta has no correct aleles, the file will be removed!!
Seems the problem is the locus in the schema are not intact CDS, such as the locus in the atpa.tfa.fasta file.

atpA.tfa.fasta.zip

chewbbaca report a new allele on a perfect match of an existing allele

Dear support,

We are facing a very strange situation in using chewbbaca (we are using chewbbaca 2.0.13).
(files are in the zip archive here:chewbbaca_issue.zip)

We have a listeria sample (denovo file denovo_spades_scaffolds.fasta) where using the web app from Pasteur (https://bigsdb.pasteur.fr/cgi-bin/bigsdb/bigsdb.pl?db=pubmlst_listeria_seqdef&page=sequenceQuery), we get a perfect match (100%) with an allele (see bigsdb_pasteur_res.csv).

Using the same Pasteur schema with chewbbaca we got a new allel. The Score of the new allele is higher of the allele with perfect match. Which is a bit counterintuitive.
(see prodigal/denovo_spades_scaffolds.potential, search with the start position "220843").

Additionally, searching with Blast produce the same result of Pasteur web app (see blast dir).

We experienced the same problem for several other alleles and samples. This leads to a lot of uncertainity. Samples that can be clearly related to an outbreak using Pasteur web app, are more distant in chewbbaca, thus leading to a not clear interpretation.

Do you have any hint for us?

Kind Regards,
Adriano Di Pasquale
Antonio Rinaldi

numpy version issue

while using numpy version 1.13.3, chewBBACA.py retrieved:

Traceback (most recent call last):
  File "/usr/bin/chewBBACA.py", line 7, in <module>
    from CHEWBBACA.chewBBACA import main
  File "/usr/lib/python3.6/site-packages/CHEWBBACA/chewBBACA.py", line 18, in <module>
    from CHEWBBACA.allelecall import BBACA
  File "/usr/lib/python3.6/site-packages/CHEWBBACA/allelecall/BBACA.py", line 19, in <module>
    from CHEWBBACA.utils import ParalogPrunning
  File "/usr/lib/python3.6/site-packages/CHEWBBACA/utils/ParalogPrunning.py", line 4, in <module>
    from numpy import array
ImportError: cannot import name 'array'

I tested using numpy-1.14-0 and it all went well, maybe you can change it to numpy>=1.14.0 in requirementes.txt

New release

Hi @mickaelsilva.

Love chewBBACA. I see there have been a number of commits since the last release on PyPI. When are you planning a new release?

Thank you.

Anders.

Add a --version flag

Should print like this to stdout (not stderr)

% chewBBACA.py --version
chewBBACA 2.0.6

Clustalw2 Check failed

Hi dears,
above all I wanna thanks you for your work; it's been extremely precious in my STRAIN profiling task.
I wanna just advise you that in SchemaEvaluator it look for a clustalw2 that actually dosn't exist because
any of the installation produce a lin "clustalw".
I solved that by correcting its name in the relative script holded in
~/anaconda3/lib/python3.8/site-packages/CHEWBBACA/SchemaEvaluator
I hope that help

Best regards
Valentino Costabile
Bioinformaticians

FileNotFound error --- Again

Hi,

I am running chewBBACA version 2.0.17, and I am getting an error with AlleleCall. I trained C. jejuni genome using prodigal 2.6.0 (also tried latest v 2.6.3) by:

prodigal -i Campylobacter_jejuni.fasta -c -m -p single -t C_jejuni.trn

PRODIGAL v2.60 [October, 2011]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.
Request: Single Genome, Phase: Training
Reading in the sequence(s) to train...1641481 bp seq created, 30.55 pct GC
Locating all potential starts and stops...46104 nodes
Looking for GC bias in different frames...frame bias scores: 2.69 0.18 0.13
Building initial set of genes to train from...done!
Creating coding model and scoring nodes...done!
Examining upstream regions and training starts...done!
Writing data to training file C_jejuni.trn...done!

Followed by schema creation as follows which apparently worked well:
chewBBACA.py CreateSchema -i ./Campy_jejuni_genomes/ -o C_jejuni_schema --ptf C_jejuni.trn --cpu 24

Total of 1704 loci that constitute the schema
Starting Script at : 16:22:15-03/07/2019
Finished Script at : 16:27:29-03/07/2019

However when running the command:
chewBBACA.py AlleleCall -i ./Campy_jejuni_genomes/ -g C_jejuni_schema/ -o results_ac --cpu 24 --ptf C_jejuni.trn

I get the following error:

Starting Allele Calling at : 16:31:30-03/07/2019
Processing 6-R-protein2671.fasta. Start 16:31:43-03/07/2019 Locus 1703 of 1704. Done 99%.Finished Allele Calling at : 16:31:43-03/07/2019
Wrapping up the results
[Errno 2] No such file or directory: '/export/proj/camp_b_wgs/NOBACKUP/chewbbaca_analysis/C_jejuni_schema/temp/1-R-protein1053.fasta_result.txt'

I get the same error while running tutorial data:

Processing GCA-001683515-protein3881.fasta. Start 17:06:37-03/07/2019 Locus 3129 of 3130. Done 99%.Finished Allele Calling at : 17:06:38-03/07/2019
Wrapping up the results
[Errno 2] No such file or directory: '/export/proj/camp_b_wgs/NOBACKUP/chewBBACA_tutorial/schema_seed/temp/GCA-000007265-protein1.fasta_result.txt'

The earlier post did not help in any way to resolve the problem. Any help will be much appreciated?

PrepExternalSchema enhancement request

When using the PrepExternalSchema command with an external database, it sometimes may take multiple hours or even days to run. It would be really nice to have a feature where if the command is interrupted for some reason or if a new loci is added later PrepExternalSchema could pick up where it left off. I am currently having that issue and now have to wait another 32 hours after only adding two new loci.

Invalid start byte error

Hi,

I keep getting the following error whenever I run chewBBACA, it doesn't seem to interrupt the program but should I be concerned?

'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

System info:
Operating system: MacOSX 10.12.6
Python 3.7.2

Thanks.

external scheme creation problem

I tried to creat salmonella scheme by myself and entered the command line as follows"chewBBACA.py CreateSchema -i cgMLST/ -o Salmonellascheme --cpu 8 --ptf training/Salmonella_enterica.trn", but it prompted error message "chewbbaca scheme creation error:BLAST Database error: No alias or index file found for protein database [/home/qifeng/temp/protogenome1/blastdbs/protogenome1_proteins_db] in search path [/home/qifeng::" halfway. How can I fix it.
P.S. I create an environment and install chewbbaca via conda

()join error str or byte not boolean when running chewBBACA 2.0.15

@jacarrico and @mickaelsilva We have updated to version 2.0.15 and a previously fine command chewBBACA.py AlleleCall -i assemblies/ -g path/to/genomes -o outputdir --cpu 36 for calling alleles causes the following error join() argument must be str or bytes, not 'bool'. I have tried the installation on our server and also your docker container with the same result... I have version 2.0.12 in a singularity container and it works just fine. I have not been able to find the problem.... Any help would be greatly appreciated!! Thanks heaps in advance. Cheers Kristy

No usable gene files in listGenes2Call.txt

Hello, I'm a newbie and I’m encountering an error with ChewBBACA while doing allele calling.

Following are the command lines I used:

########Create Schema
chewBBACA.py CreateSchema -i /home/Consensus/ --cpu 4 -o /home/Chewbbaca-Results --ptf /home/chewBBACA/CHEWBBACA/prodigal_training_files/Escherichia_coli.trn

---> Outputs include: folder /home/temp/, with 374 sub-folders of protogenomes and ORF.txt files for all the genomes; listGenes2Call.txt (0 bytes) and listGenomes2Call.txt (26.7kb) in /home/ChewBBACA/CHEWBBACA/

####### Allele Call
chewBBACA.py AlleleCall -i /home/Consensus/ -g /home/temp/ --cpu 4 -o /home/Saturn-Chewbbaca-Results --ptf /home/chewBBACA/CHEWBBACA/prodigal_training_files/Escherichia_coli.trn

chewBBACA version 2.0.16 by Mickael Silva at https://github.com/B-UMMI/chewBBACA
email contact: [email protected]

'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
...
'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
will use this training file : /home/chewBBACA/CHEWBBACA/prodigal_training_files/Escherichia_coli.trn
blastp
Will use this number of cpus: 4
Checking all programs are installed
Checking Blast installed... True
Checking Prodigal installed... True
blast version is up to date, the program will continue

Starting Script at : 11:10:01-09/01/2019
checking if genome files exist..
checking if gene files exist..
ERROR! No usable gene files in listGenes2Call.txt

This file listGenes2Call.txt is indeed 0 byte in size. But I dont know why this happened.

Any help is appreciated.

How to retrieve original prodigal gene sequence?

Hello friends, I have successfully created a cgMLST schema that has 2896 alleles out of 600 genomes, using 99% cutoff. I would like to fetch the original sequences predicted by prodigal for each genome so I could identify recombination sites on cgMLST alleles.

I started using the allele ID output in cgMLSTschema.txt, this has exactly 2896 lines. When I inspected these files, they do not have 600 sequences within but harbor a variable number of sequences according to the number of alleles.

Well after digging I noticed that I just need a script to search allele matrix file and fetch the sequence upon allele number.

Best,

CreateSchema and AlleleCall fail with plasmid sequences or other type of data with shorter sequences

Greetings,

Reporting two issues that arise when short sequences (as plasmids) or data with less than 2 CDSs is given as input -i for the CreateSchema and AlleleCall processes.

Issue 1:
Using files with plasmid sequences as input for CreateSchema or AlleleCall generally leads to early process failure because Prodigal default mode does not process sequences below a certain length value (20000bp).

Possible fix:

  • Add a new argument that enables users to choose the running mode for Prodigal. Define single as default and add the option to change to meta so that Prodigal accepts shorter sequences. Users should be prompted when they did not choose the most suitable option according to sequences length. It might be possible to automatically choose the most suitable mode to run Prodigal.

After applying this fix to local installation files, Prodigal could extract CDSs from shorter sequences and schemas could be created from large collections of plasmids (some with less than 1000bp).

Issue 2:
The AlleleCall process fails, somewhat silently, when input genomes/plasmids have 1 or 0 CDSs. The process does not create <genome_id>_ORF_Protein.txt and <genome_id>_Protein.fasta files for genomes that have less than 2 CDSs. While performing calling, the program attempts to open those files that were not created, leading to an exception that is silenced by Python's multiprocessing package and the standard output keeps showing as expected (advancing faster) until the calling ends and the process fails because <gene_id>_result.txt and <gene_id>_results2.txt files cannot be found.

Possible fix:

  • Redefine variable that controls this behavior to allow cases with just 1 CDS. Genomes/sequences from which Prodigal cannot extract any valid CDSs should be removed from the list of input data in initial steps to avoid any issues (conveniently warning users that those genomes/sequences were removed and why they were removed).

After applying this fix to local installation files, the AlleleCall process created all necessary files for plasmids with 1 or more CDSs and plasmids without valid CDSs were removed from the analysis. The process completed successfully, appropriately populating the schema .

Error parsing blast 2.7.1 ? Demands 2.5.0

The docs say it supports 2.5.0 or higher, but it seems to not parse newer BLAST versions?

% chewBBACA.py AlleleCall -i ./genomes/ -g ./wgMLST/short/ -o Allele_call_out --cpu 16

chewBBACA version 2.0.5 by Mickael Silva at https://github.com/B-UMMI/chewBBACA
email contact: [email protected]


blastp
Will use this number of cpus: 16
Checking all programs are installed
Checking Blast installed... True
Checking Prodigal installed... True
your blast version is b'blastp: 2.7.1+\n'
update your blast to 2.5.0 or above, will exit program

Here is our blast:

% blastp -version
blastp: 2.7.1+
 Package: blast 2.7.1, build Oct 18 2017 19:57:24

Shorter allele reported as new allele

Dear developers,
I'm a beginner with chewBBACA and I realize that AlleleCall reported as new some allele that are just smaller than the one in the scheme of 6 or 9 base out of 400-500 bp.
I have this problem frequently but I thought that there was a sort of lenght tolerance to assign the allele (I run chewBBACA with all default parameteres).
An example of the amount of difference is attached here allele_differences.txt
Is there a way to assign the existing allele instead of creating a new one?

Thank you for your help!

VR

please tag new releaase.

Hello Pipy is providing version 2.1.0 while github still provides version 2.05
please tag a new release accordingly to keep pypi and git repo in sync.

regards

Eric

About using the external schema in the Schema Create process

Hi, on my practice following the Wiki recipe, I had noticed It allowed us to use the external schema, such as them from Ridom cgMLST, BIGSdb or Enterobase. So I want to confirm with you that in the ridom seqsphere website I found a link which could enable us to down load the cgMLST schema:https://www.cgmlst.org/ncs/schema/3953420/locus/
So could you kindly tell me whether the link, Download alleles as fasta, in the page could allow us to get what we could use in the chewBBAC PrepExternalSchema process, and what the difference between the training schema by CreateSchema and the external schema like from Ridom?
Thank you!

Tag a Release?

Hello,

I'd like to package chewBBACA up into a conda recipe for the bioconda channel. Do you have any plans to tag a release on this repository?

Cant find my files

Hi!

Stupid question, but I've been running chewbbaca (v2.0.17) for some time, and ive never had this issue before.

I have tried several things, including copying commands I have used successfully in the past, but I get the same error:
Starting Script at : 12:20:21-19/08/2020
checking if genome files exist..
checking if gene files exist..
436
[Errno 2] No such file or directory: ''
[Errno 2] No such file or directory: ''

    USAGE : chewBBACA.py [module] -h

My command is:
chewBBACA.py AlleleCall -i genomes_shovill_all/ -g ~/chewbacca_wrkdir/schema/Campylobacter_jejuni/Schema/schema_seed_campy_roary_V5/ -o test_shovill --cpu 3 --ptf ~/chewbacca_wrkdir/prodigal_training_files/Campylobacter_jejuni.trn

My system is:
CentOS Linux release 7.8.2003 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"CentOS Linux release 7.8.2003 (Core)
CentOS Linux release 7.8.2003 (Core)

Thanks ! :)

Training files

Hi, I am writing a suggestion that you add training files to the ChewBBACA repository. First, it will help users stay consistent between runs. Second, your centralized training files will help us stay consistent between laboratories. Third, these files seem to be well thought out and would help control for any user errors. https://github.com/mickaelsilva/prodigal_training_files

ImportError with Biopython 1.78

Biopython removed the Bio.Alphabet module in Biopython 1.78.
chewBBACA has several modules with the following import:

    from Bio.Alphabet import generic_dna

Importing from Bio.Alphabet or using the generic_dna leads to the following ImportError:

    ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be
    ignored and removed from script. In a few cases, you may need to specify the ``molecule_type`` as an
    annotation on a SeqRecord for your script to work correctly. Pease see https://biopython.org/wiki/Alphabet
    for more information.

This issue needs to be solved. Current version that can be installed through conda or PyPI will not work properly while this issue is not solved.

Prodigal training files to use - README instructions

Hi,

In chewBBACA's README, in the IMPORTANT section, there's a warning that states that As from 02/02/2018 prodigal training files to be used are now on a separate repository. You can find them at https://github.com/mickaelsilva/prodigal_training_files.

Currently, there are training files provided with chewbbaca's source code that aren't available at https://github.com/mickaelsilva/prodigal_training_files. I suggest either revise the README to state where this files can be found in the source code, or if the decision is to separate these from the source code, move them to a new repository inside the B-UMMI organization, and update the README accordingly.

Best,

Inês

%i format: a number is required, not NoneType

I am trying to use the very large external scheme Enterobase Salmonella cgMLSTv2. After running chewBBACA.py PrepExternalSchema, I tried running chewBBACA.py AlleleCall, but I get the following errors. First, 'utf-8' codec can't decode byte is repeated over and over again and then I get the following repeated over and over again:

some error occurred
%i format: a number is required, not NoneType
Error on line 466

I did get output files, but there are many ERRORs in the results_alleles.tsv file. As far as I can tell, PrepExternalSchema ran correctly, minus the removal of two loci with no correct alleles (I'm assuming they're not complete CDS). If the input scheme is correct, do the errors I'm getting have to do with the contig FASTA files I'm using? Any help would be much appreciated. Thank you!

persisten temp folder not getting removed

hi @jacarrico,
I've been trying to implement chewBBACA in a nextflow pipeline, using a schema generated with PrepExternalSchema (separate to the nextflow pipeline). When I run AlleleCall a temp folder is created in my schemaDirectory, this seems to lead to error in subsequent runs with this schema. It raises a ValueError
ValueError: '/listeria_db/lmo1074.fasta' is not in list it should be noted that lmo1074.fasta is not an actual file in the listeria_db directory. So I am a). unsure where this file name even comes from and b). why the temp folder is persisting. I am using chewBBACA version 2.0.8.
Thanks in advance for your time, I appreciate any help that you can give me.
Regards
Kristy Horan

chewBBCA v2.1.0 Errno21 error on input directory

Hello,

I have chewBBACA version 2.1.0, I used it just fine for several months until yesterday when suddenly started having this "Errno21: Is a directory:..." error. I noticed this error has been discussed here before concerning the short folder in the scheme directory but this was already solved in my version of chewBBACA and my problem concerns the input directory. I tried uninstalling/reinstalling the package but this did not solve the problem.

May it have been related with some installations and updates I did of other software perhaps(?)

Example code:

$ chewBBACA.py AlleleCall -i assembly/ -g ~/Data/cgMLSTschemes/MLST-573/cgMLST/scheme-573/ -o TEST --cpu 2

chewBBACA version 2.1.0 by Mickael Silva at https://github.com/B-UMMI/chewBBACA
email contact: [email protected]
[Errno 21] Is a directory: 'assembly/'
USAGE : chewBBACA.py [module] -h

Thank you in advance

[Errno 21] Is a directory: 'listeria_db/short'

When running chewBBACA I consistently get this error. When running it by itself it will complete, however I am trying to incorporate it into a pipeline (snakemake) and this error is causing it to break (pipe fail). Is there something I can do to prevent this error?
Cheers
Kristy

Include a useful symlink

chewBBACA.py is difficult to remember and type.

Can you addirtionally include a symlink eg. chewie for the rest of us?

100% blast hit with allele but different allele is called or other output(LNF/ASM/ALM/NIPH)

Hi!

I've just started using chewBBACA so I could be missing something here but I can't tell what at the moment. The issue I'm having is that I'm using an external scheme for C. difficile and it's calling the reference genome alleles wrong. All of the geneX_1 alleles are from the reference genome (https://jcm.asm.org/content/56/6/e01987-17/figures-only) and as such all the alleles called should be 1.

Instead, most are 1, but there's also 216 that are a combination of different profiles and even loci not found. I blasted all these alleles against the reference and they were all present 100%.

Here's what I did:

  • Create training file

prodigal -i R00000003.fna -p train -t R00000003.fasta.trn
This training file is used later during allele calling

  • Prep external schema

chewBBACA.py PrepExternalSchema -i ~/scheme_fastas/ --cpu 16

With the C. difficile scheme from cgMLST.org (https://www.cgmlst.org/ncs/schema/3560802/) and fasta headers (already numbered) edited to include gene name (>genename_number)

  • Call alleles with reference genome (isolate 630/R00000003.fasta)

chewBBACA.py AlleleCall -i ~/paper_genomes_samples/ -g ~/scheme_fastas/ -o results_cg --cpu 12 --ptf ~/630_prodigal_training/R00000003.fasta.trn

Which provides this allele call output:

R00000003.fasta,1,1,1,1,1,1,1,1,34,1,1,55,1,1,1,1,1,1,1,1,1,1,41,1,1,1,1,1,1,1,83,1,1,1,1,40,1,1,1,1,1,1,1,1,1,1,1,1,69,1,40,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,45,1,1,1,1,1,1,1,1,18,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,60,1,1,1,1,1,LNF,1,1,1,1,1,1,1,1,1,104,1,1,1,1,1,43,1,1,1,1,197,1,1,1,1,1,1,1,1,1,128,1,1,60,1,1,1,1,1,1,82,1,1,1,75,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,97,1,1,1,1,1,1,1,1,1,1,53,1,1,1,1,1,96,1,1,1,1,1,1,1,1,1,75,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,193,1,1,1,1,1,178,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,225,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,ALM,1,1,1,1,1,1,1,1,1,1,115,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,NIPH,1,1,1,1,123,1,1,1,1,1,1,42,1,1,124,1,1,1,1,1,92,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,92,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,99,1,1,195,1,1,1,1,150,1,1,1,1,1,1,1,1,1,1,1,1,58,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,107,1,150,1,1,128,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,98,1,1,1,1,1,1,1,1,1,1,1,1,119,1,1,1,1,80,1,1,1,1,1,1,1,1,1,177,1,1,1,1,1,1,1,1,1,1,1,192,1,117,1,1,1,1,1,1,1,1,1,1,54,1,1,1,1,1,1,1,1,1,163,1,1,1,1,1,146,1,1,1,1,1,85,1,1,1,1,1,1,1,1,1,135,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,64,1,113,1,102,1,1,1,1,1,1,1,1,1,1,1,1,1,1,72,1,1,1,1,1,1,1,1,1,1,1,1,1,1,147,1,1,1,1,1,1,74,1,1,1,77,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,57,1,1,1,85,1,1,1,1,1,1,1,1,1,1,1,1,1,34,1,1,1,1,1,121,1,1,1,1,1,1,1,1,1,1,1,1,93,1,1,141,1,1,1,1,1,1,1,1,1,1,146,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,94,1,1,1,1,1,1,1,1,1,1,1,1,91,1,1,1,1,1,1,1,119,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,1,1,1,1,1,1,105,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,47,1,1,1,1,1,1,1,1,1,1,1,ALM,1,1,1,LNF,1,1,169,1,1,1,1,1,1,96,1,1,1,1,1,1,1,156,1,1,90,1,1,1,1,1,1,1,1,42,1,26,1,1,1,1,1,1,1,1,95,1,1,67,1,1,99,1,1,1,1,164,1,1,148,1,1,1,1,94,1,1,1,36,1,1,1,1,1,1,1,283,1,1,1,1,1,75,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,116,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,205,1,1,1,1,1,1,1,1,1,1,1,150,1,1,1,1,1,1,1,1,1,1,1,1,1,119,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,68,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,84,1,1,1,1,1,1,1,1,1,1,96,1,1,1,1,1,1,54,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,103,1,1,1,1,1,1,86,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,58,1,116,1,1,1,1,1,1,1,1,1,99,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,145,1,1,1,1,1,1,1,107,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,52,1,68,1,1,1,1,1,1,62,1,1,1,1,68,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,ASM,1,1,1,37,1,1,1,1,72,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,120,1,1,1,1,1,1,1,1,1,103,1,1,1,1,1,1,1,1,1,1,1,1,1,128,102,1,1,108,1,177,243,1,1,1,1,1,1,1,99,1,1,1,1,1,1,1,45,1,1,1,135,1,1,1,1,1,100,92,186,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,95,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,242,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,132,1,1,1,1,1,1,1,ALM,1,1,LNF,1,1,1,1,1,1,28,1,93,1,1,160,1,1,1,99,1,1,1,1,1,170,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,41,1,59,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,89,1,1,1,1,97,20,1,1,1,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,1,150,1,1,1,1,1,108,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,160,1,1,1,138,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,144,1,1,1,1,1,1,1,1,150,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,103,1,1,1,1,190,1,1,1,1,76,1,1,1,1,1,116,1,1,118,1,1,1,1,1,90,1,1,1,1,1,1,1,1,1,1,81,99,1,1,65,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,83,176,68,1,1,1,1,1,1,1,1,1,1,1,1,1,1,207,1,1,1,1,1,86,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,53,1,1,1,221,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,89,1,1,1,1,1,1,99,1,1,1,1,1,1,1,1,NIPH,213,1,1,1,1,NIPH,1,1,43,1,1,1,1,1,1,1,1,1,187,1,1,1,1,1,LNF,1,1,1,107,1,1,1,NIPH,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,NIPH,1,1,1,1,1,1,1,1,135,1,1,1,1,1,1,1,1,1,124,1,1,1,1,1,1,1,1,1,1,1,1,1,1,68,1,1,1,1,1,1,1,176,1,1,1,1,1,1,NIPH,1,1,70,1,1,1,1,1,1,123,165,1,1,1,1,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,173,1,1,1,174,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,158,180,1,1,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,1,1,1,1,1,15,1,1,1,1,1,1,1,1,1,1,1,1,90,1,1,1,1,1,1,1,1,1,60,1,1,1,1,1,1,1,1,1,1,1,122,1,ALM,100,1,1,1,1,1,1,1,1,1,1,LNF,1,1,1,1,1,66,1,131,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,127,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,125,1,1,1,1,1,1,1,1,1,1,1,1,1,92,1,1,98,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,56,1,1,1,1,1,1,1,1,1,1,ASM,1,1,1,1,1,1,114,1,1,1,1,1,51,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,62,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,113,1,50,1,1,1,61,1,1,105,1,108,1,1,1,1,1,1,1,142,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,ALM,1,1


The output should be 1 for every allele according to the original scheme development paper. I checked this with blast and all allele 1 sequences are 100% present in the reference (no gaps, no mismatches etc.)

Is this an issue with chewBBACA, the training file, the difference between chewBBACA and seqsphere (used to generate the scheme) or am I missing something?

PrepExternalSchema problem

My version is 2.0.16 and installed it via conda. When I ran PrepExternalSchema, I found the usage "chewBBACA.py [-h] -i [I] [--cpu [CPU]] [-v] PrepExternalSchema [PrepExternalSchema ...]" didn't work. So I tried the command "chewBBACA.py PrepExternalSchema salschema -i cgMLST/ --cpu 8 -v" which put forward in this forum. Finally it just generated a directory named with "short" including .fasta files and corresponding txt files but which contained just gibberish. I have no idea that if this can be used as schema or I can download the scheme of this site "https://zenodo.org/record/1323684#.XH6LY5NKhPY" and call allele variations without Preparing external schema.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.