neherlab / pan-genome-analysis Goto Github PK

Processing pipeline for pan-genome visulization and exploration

License: GNU General Public License v3.0

Python 99.52% Shell 0.48%

pan-genome-analysis's Introduction

panX: microbial pan-genome analysis and exploration

Wei Ding, Franz Baumdicker, Richard A Neher; panX: pan-genome analysis and exploration, Nucleic Acids Research, Volume 46, Issue 1, 9 January 2018, Pages e5, https://doi.org/10.1093/nar/gkx977

Overview: panX is a software package for microbial pan-genome analysis, visualization and exploration. The analysis pipeline is based on DIAMOND, MCL and phylogeny-aware post-processing. It takes a set of annotated bacterial strains as input (e.g. NCBI RefSeq records or user's own data in GenBank format). All genes from all strains are compared to each other via DIAMOND and then clustered into orthologous groups using MCL and adaptive phylogenetic post-processing, which split distantly related genes and paralogs if necessary. For each gene cluster, corresponding alignment and phylogeny are constructed. All core gene SNPs are then used to build strain/species phylogeny.

The results can be interactively explored using a powerful web-based visualization application (either hosted by web server or run locally on desktop). The web application integrates various interconnected components (pan-genome statistical charts, gene cluster table, alignment, comparative phylogenies, metadata table) and allows rapid search and filter of gene clusters by gene name, annotation, duplication, diversity, gene gain/loss events, etc. Strain-specific metadata are integrated into strain phylogeny such that genes related to adaptation, antibiotic resistance, virulence can be readily identified.

Pipeline overview
Quick start
How to run
Directory structure and analysis output
Command line arguments

Pipeline overview

Quick start

git clone https://github.com/neherlab/pan-genome-analysis.git
cd pan-genome-analysis

Install dependencies easily via Conda and then run the test: sh run-TestSet.sh

The results can be explored using our interactive pan-genome-visualization application.

Installing dependencies

Conda

The required software and python packages can be readily installed using Conda.

wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
export PATH=~/miniconda2/bin:$PATH
conda env create -f panX-environment.yml
source activate panX

Overview of dependencies:

How to run

To run the test set: sh run-TestSet.sh

In data/TestSet, you will find a small set of four Mycoplasma genitalium genomes that is used in this tutorial. Your own data should also reside in such a folder within data/ -- we will refer to this folder as run directory below. The name of the run directory is used as a species name in down-stream analysis.

All steps can be run in order by omitting the -st option, whereas using -st 5 6 will specify the analysis steps. If running only specific steps such as -st 5 6, steps before 5 should already be finished.

-t sets the number of CPU cores.

./panX.py -fn data/TestSet -sl TestSet -t 32 > TestSet.log 2> TestSet.err

This calls panX.py to run each step using scripts located in folder ./scripts/

./panX.py [-h] -fn folder_name -sl species_name
                   [-st steps [steps ...]] [-rt raxml_max_time]
                   [-t threads] [-bp blast_file_path]

Mandatory parameters: -fn folder_name / -sl species_name
NOTICE: species_name e.g.: S_aureus
Example: ./panX.py -fn ./data/TestSet -sl TestSet -t 32 > TestSet.log 2> TestSet.err

Directory structure and analysis output

The analysis generates clustering result ./data/YourSpecies/allclusters_final.tsv

and files required for visualizing the pan-genome using pan-genome-visualization.

./data
    YourSpecies               # folder specific to the your pan genome
      - input_GenBank              # INPUT: genomes in GenBank format
        - strain1.gbk
        - strain2.gbk
        ...
      - vis
        - geneCluster.json       # for clusters table: gene clusters and their summary statistics
        - strainMetainfo.json    # for metadata table: strain-associated metadata
        - metaConfiguration.js   # metadata configuration file (also accept valid customized file)
        - coreGenomeTree.json    # core genome SNP tree (json file)
        - strain_tree.nwk        # core genome SNP tree (newick file)

        - geneCluster/           # folder contain orthologous clusters
                                 # nucleotide and amino acid alignment in gzipped FASTA format
                                 # reduced alignment contains a consensus sequence and variable sites (identical sites shown as dots)
                                 # tree and presence/absence(gain/loss) pattern in json format
          - GC00000001_na_aln.fa.gz
          - GC00000001_aa_aln.fa.gz
          - GC00000001_na_aln_reduced.fa.gz
          - GC00000001_aa_aln_reduced.fa.gz
          - GC00000001_tree.json
          - GC00000001_patterns.json

In which step different files and directories are produced is described in more details in step-tutorials.md.

Command line arguments

(Click here for more details)

Soft core-gene:

-cg    core-genome threshold [e.g.: 0.7] percentage of strains used to decide whether a gene is core
E.g.: ./panX.py -cg 0.7 -fn ...

Large dataset (use divide-and-conquer(DC) strategy which scales approximately linearly with the number of genomes):

-dmdc  apply DC strategy to run DIAMOND on subsets and then combine the results
-dcs   subset size used in DC strategy [default:50]
E.g.: ./panX.py -dmdc -dcs 50 -fn ...

Calculate branch associations with metadata (e.g. drug concentration):

-iba  infer_branch_association
-mtf  ./data/yourSpecies/meta_config.tsv
E.g.: ./panX.py -iba -mtf ./data/yourSpecies/meta_config.tsv -fn ...

Example: meta_config.tsv

To bring the branch association into effect for the visualization, one needs to add the generated file to the visualization repository as described in Special feature: visualize branch association(BA) and presence/absence(PA) association.

pan-genome-analysis's People

Contributors

Stargazers

Watchers

pan-genome-analysis's Issues

ValueError

Hello,

I got the recommendation to use this programs.

But, when i use my file, I used this command

./panX.py -fn data/burk_sample -sl burkholderia -t 32

the error came.

Running panX in main folder: /home/star/pan-genome-analysis-master/data/burk_sample/
====== step01: strain list successfully loaded
====== starting step03: extract sequences from GenBank file
Traceback (most recent call last):
File "./panX.py", line 256, in
myPangenome.extract_gbk_sequences()
File "/home/star/pan-genome-analysis-master/scripts/pangenome_computation.py", line 128, in extract_gbk_sequences
extract_sequences(self.path, self.strain_list, self.folders_dict, self.gbk_present, self.enable_RNA_clustering)
File "/home/star/pan-genome-analysis-master/scripts/sf_extract_sequences.py", line 156, in extract_sequences
gene_aa_dict, gene_na_dict, RNA_dict, enable_RNA_clustering)
File "/home/star/pan-genome-analysis-master/scripts/sf_extract_sequences.py", line 42, in gbk_translation
for contig in SeqIO.parse(gbk_fname,'genbank'):
File "/home/star/anaconda3/envs/panX/lib/python2.7/site-packages/Bio/SeqIO/init.py", line 661, in parse
for r in i:
File "/home/star/anaconda3/envs/panX/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 493, in parse_records
record = self.parse(handle, do_features)
File "/home/star/anaconda3/envs/panX/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 477, in parse
if self.feed(handle, consumer, do_features):
File "/home/star/anaconda3/envs/panX/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 444, in feed
self._feed_first_line(consumer, self.line)
File "/home/star/anaconda3/envs/panX/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 1345, in _feed_first_line
'position 75 in date:\n' + line)
ValueError: LOCUS line does not contain - at position 75 in date:
LOCUS BX571965.1 4074542 bp DNA linear 19- 1월-2021

Additionally, I download using miniconda follow steps

How can I fix?
Thank you for your time.

Roary output for downstream panX pipeline

Hello,

I have had an issue using my output from Roary with your tool. I have used both a 95 sample and smaller sample (n=10 with compliant GeneBank settings from Prokka-1.13.3) and had similar issues when running steps 1 - 11 omitting step 2. There appears to be a potential hardcoding issue with Diamond, at least based on how I'm interpreting the error message, which because of my novice coding expertise, I have not been able to figure out. It seems that before the MAFFT step, the pipeline is expecting a tmp file from a Diamond output: tmp_core_diversity.txt. Here is the output from my error file, which is what is returned with the use of either Roary output clustered protein file:

Traceback (most recent call last):
File "./panX.py", line 287, in
myPangenome.process_clusters()
File "/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/scripts/pangenome_computation.py", line 180, in process_clusters
myClusterCollector.estimate_raw_core_diversity()
File "/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity
self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species)
File "/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity
calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path)
File "/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity
with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file:
IOError: [Errno 2] No such file or directory: '/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/data/HTX_Kpn_Roary/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

Any feedback would be appreciated on how to resolve this issue! Thanks!

geneCluster & core_geeList.txt output files

Issue with step 8

Hello guys, how r u? I'm having a problem with step08. I get the following error message:

======  starting step08: run fasttree and raxml for tree construction
 fasttree time-cost:  1.45 minutes (87.06 seconds)
RAxML tree optimization within the timelimit of 30 minutes
RAxML branch length optimization and rooting
Traceback (most recent call last):
  File "./panX.py", line 303, in <module>
    myPangenome.build_core_tree()
  File "/home/julian/pan-genome-analysis/scripts/pangenome_computation.py", line 200, in build_core_tree
    aln_to_Newick(self.path, self.folders_dict, self.raxml_max_time, self.raxml_path, self.threads)
  File "/home/julian/pan-genome-analysis/scripts/sf_core_tree_build.py", line 75, in aln_to_Newick
    shutil.copy('RAxML_result.branches', out_fname)
  File "/home/julian/miniconda2/envs/panX/lib/python2.7/shutil.py", line 119, in copy
    copyfile(src, dst)
  File "/home/julian/miniconda2/envs/panX/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'RAxML_result.branches'

I checked the raxml log and I found this:

Option -T does not have any effect with the sequential or parallel MPI version.
It is used to specify the number of threads for the Pthreads-based parallelization

RAxML can't, parse the alignment file as phylip file
it will now try to parse it as FASTA file

ERROR: Sequence AF-673 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence CMC-MDR-Ab59 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence HRAB-85 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence KAB07 consists entirely of undetermined values which will be treated as missing data
ERROR: Found 4 sequences that consist entirely of undetermined values, exiting...

So, I figured there might be a problem with the fasta files that gets generated in the previous steps. Any ideas on how to fix this?

Issue with Step-09

I found an issue in the step 9 and I am not able to sort it out. The screenshots attached are of the .err and .log file. It will be a great help if I can get a way of fixing it.
Thank you.

long_split issue

Hi there, how r u? I'm getting an error while trying to run the script for comparing three genomes. I'm runnning the following code:
./panX.py -cg 0.66 -fn data/AbSputum -sl "Acinetobacter baumannii Sputum Isolates"
(I'm using -cg 0.66 because apparently the genomes are a bit diverse)

#times of splitting long branches: -1
Traceback (most recent call last):
File "./panX.py", line 287, in
myPangenome.process_clusters()
File "/home/julian/pan-genome-analysis/scripts/pangenome_computation.py", line 185, in process_clusters
myClusterCollector.postprocessing_split_long_branch()
File "/home/julian/pan-genome-analysis/scripts/cluster_collective_processing.py", line 25, in postprocessing_split_long_branch
postprocess_split_long_branch(self.threads, self.path, self.simple_tree, self.split_long_branch_cutoff)
File "/home/julian/pan-genome-analysis/scripts/sf_split_long_branch.py", line 294, in postprocess_split_long_branch
with open(file_path+'old_clusters_longSplit.txt', 'rb') as delete_cluster_file:
IOError: [Errno 2] No such file or directory: '/home/julian/pan-genome-analysis/data/AbSputum/geneCluster/old_clusters_longSplit.txt'

any idea how can i fix this? Thanks in advanced for any help you can provide to me!

Running workflow for new data

Do you have guidance on how to run your workflow on new genomes not available on RefSeq. I am interested in producing the all_genes_alignment file shown on the PanX website.

Many thanks in advance.

step 11: ValueError: need more than 1 value to unpack

hello,

I am running panX with branch associations. I have a metadata.tsv file and a descriptive metainfo.tsv file.
I keep getting this error on step 11:

/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_association.py:188: RuntimeWarning: divide by zero encountered in divide
*((root_node.meta_sq_value - n.meta_sq_value)/n_non_child
/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_association.py:189: RuntimeWarning: invalid value encountered in double_scalars

n.meta_ancestral_average**2)
/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_association.py:200: RuntimeWarning: invalid value encountered in sqrt
np.sqrt(n.meta_ancestral_SSEM + n.meta_derived_SSEM)
/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_association.py:272: RuntimeWarning: invalid value encountered in absolute
association_dict[clusterID][d["meta_category"]] = np.abs(score)
Traceback (most recent call last):
File "./panX.py", line 329, in
myPangenome.export_coreTree_json()
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/pangenome_computation.py", line 229, in export_coreTree_json
json_parser(self.path, self.folders_dict, self.fpaths_dict, self.metainfo_fpath, self.meta_data_config, self.clean_temporary_files)
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 264, in json_parser
process_metajson(path, meta_data_config, metajson_dict)
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 173, in process_metajson
meta_category, data_type, display, associate= iline.rstrip().split('\t')[:4]
ValueError: need more than 1 value to unpack

thank you in advance for any help
Nurper

Failed to export myPangenome's geneCluster.json (step 10)

Hi
I encounter an error while running the next command line
./panX.py -fn ./data/Vibrio_complete -sl Vibrio_complete -dmdc -t 1 > Vibrio_complete.log

The error follows as

Traceback (most recent call last):
File "./panX.py", line 322, in
myPangenome.export_geneCluster_json()
File "/home/jason/Documents/pan-genome-analysis/scripts/pangenome_computation.py", line 225, in export_geneCluster_json
geneCluster_to_json(self.path, self.enable_RNA_clustering, self.store_locus_tag, self.raw_locus_tag, self.optional_table_column)
File "/home/jason/Documents/pan-genome-analysis/scripts/sf_geneCluster_json.py", line 171, in geneCluster_to_json
'"divers":"'+gene_diversity_Dt[clusterID]+'"',
KeyError: 'GC00002056'

Segmentation fault

Hello,

I recently tried to run panX.py but I've been getting a Segmentation fault error. I installed all the dependencies via conda as suggested and then tried to run the test script. Instead, I got the error message below:

I'm not so sure what the problem is...maybe something with the python script? Really hope you can help.

Many thanks in advance

File input

Hi - Thanks for making this pipeline available! Quick question, are .gbk files the only accepted input file format?

Thanks for your time and have a great weekend,

Markus

Failed to run Test

Hi,
I have used PanX successfully a few month ago. Now that i have migrated on ubuntu 18.04, i tried to reinstall it via git clone then conda install, but i failed to run the test... Here is the error message :

mv: impossible d'évaluer 'GC00000142_aa_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000142_na_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000364_aa_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000364_na_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000322_aa_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000322_na_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000172_aa_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000172_na_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000217_aa_aln.fa': Aucun fichier ou dossier de ce type
mv: impossible d'évaluer 'GC00000217_na_aln.fa': Aucun fichier ou dossier de ce type
Traceback (most recent call last):
  File "./panX.py", line 303, in <module>
    myPangenome.build_core_tree()
  File "/home/stheil/soft/pan-genome-analysis/scripts/pangenome_computation.py", line 200, in build_core_tree
    aln_to_Newick(self.path, self.folders_dict, self.raxml_max_time, self.raxml_path, self.threads)
  File "/home/stheil/soft/pan-genome-analysis/scripts/sf_core_tree_build.py", line 44, in aln_to_Newick
    resolve_polytomies('initial_tree.newick0','initial_tree.newick')
  File "/home/stheil/soft/pan-genome-analysis/scripts/sf_core_tree_build.py", line 8, in resolve_polytomies
    tree = Tree(newickString);
  File "/home/stheil/anaconda3/envs/panX/lib/python2.7/site-packages/ete2/coretype/tree.py", line 218, in __init__
    read_newick(newick, root_node = self, format=format)
  File "/home/stheil/anaconda3/envs/panX/lib/python2.7/site-packages/ete2/parser/newick.py", line 231, in read_newick
    raise NewickError('Unexisting tree file or Malformed newick tree structure.')
ete2.parser.newick.NewickError: Unexisting tree file or Malformed newick tree structure.

If any idea where to start to debug, it would be much appreciated ;)
Thanks

Seb

Error in step06: align genes in geneCluster by mafft and build gene trees

I tried to compare Sulfurimonas genomes but the workflow didn't finish successfully.

./panX.py -fn data/BS_Sulfurimonas -sl Sulfurimonas -t 28

(...)

======  starting step06: align genes in geneCluster by mafft and build gene trees
Traceback (most recent call last):
  File "./panX.py", line 287, in <module>
    myPangenome.process_clusters()
  File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters
    myClusterCollector.estimate_raw_core_diversity()
  File "/data/tools/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity
    self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species)
  File "/data/tools/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity
    calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path)
  File "/data/tools/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity
    with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file:
IOError: [Errno 2] No such file or directory: '/data/tools/pan-genome-analysis/data/BS_Sulfurimonas/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

Do you have any idea where the problem might originate and how I could solve it? If there is more information I can provide, please let me know.

Biopython Parser Warning

Hi,
I would kindly ask for some assistance. I tried panX with 22 genomes of a marine bacteria downloaded from Patric server (files downloaded as gto and converted to gbk files) but the script did not complete the task. From the outputs the most frequent error included:

home/kruno/miniconda2/envs/panX/lib/python2.7/site-packages/Bio/GenBank/Scanner.py:1310: BiopythonParserWarning: Truncated LOCUS line found - is this correct?
:'LOCUS BAUG01000064 15005 bp dna linear UNK \n'
"correct?\n:%r" % line, BiopythonParserWarning)

that was repeated multiple times but the analysis continued and then the script halted with the last error message, the traceback call, being the following:

Traceback (most recent call last):
File "./panX.py", line 287, in
myPangenome.process_clusters()
File "/home/kruno/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters
myClusterCollector.estimate_raw_core_diversity()
File "/home/kruno/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity
self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species)
File "/home/kruno/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity
calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path)
File "home/kruno/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity
with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file:
IOError: [Errno 2] No such file or directory: '/home/kruno/pan-genome-analysis/data/Tenacibaculum_maritimum/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

The problem in troubleshooting this for me is that in the log folder there are no raxml nor fasttree logs and do not know how to proceed, I mean I neither understand what went wrong nor what to do to fix it. I am using Ubuntu 18.04 LTS and panX was installed following instructions and panX environment created successfully.

If more information is needed please let me know what to add.

Thank you in advance.
Kruno

python3

Dear developers,
I wonder if you are thinking in implementing a python3 version of this tool.

Thank you very much

segfault error

Hello,

I would like to try out panx but I am getting a generic segfault error when attempting to run it on the TestSet on my local cluster. I'm not an experienced programmer, so any guidance for how to troubleshoot would be greatly appreciated.

Arianna

panX output file created but it empty?

Hi, I am running panX tool for pangenome analysis and it create directory but output like json file and other are not there. What this mean. Please help me if anyone know anything about it.

Thank you!

No module Numpy

I am using Google cloud to run panX : Ubuntu 18.04.
The following error arises:
Traceback (most recent call last):
File "./panX.py", line 6, in
from pangenome_computation import pangenome
File "/home/paul_hrab05/pan-genome-analysis/scripts/pangenome_computation.py", line 3, in
from cluster_collective_processing import clusterCollector
File "/home/paul_hrab05/pan-genome-analysis/scripts/cluster_collective_processing.py", line 1, in
from sf_geneCluster_align_makeTree import cluster_align_makeTree
File "/home/paul_hrab05/pan-genome-analysis/scripts/sf_geneCluster_align_makeTree.py", line 1, in
import numpy as np
ImportError: No module named numpy

However, numpy is installed. I was trying install via conda and pip - "module is installed". run_Test.sh encountered the same error.
Can you help me?
Thanks!

Long running time or bug?

Hi,

I runned the command-line ./panX.py -fn /.data/MyData -sl T_species

Everything seemed good till step 6 I think... Now this step is running for 10 minutes. I only have 5 gbk files to compare in MyData. Is it normal? Should I let it continue or it's a bug?

I already checked and I've correctly installed all dependencies.

Also, I encountered the same problem when I launched the test.sh script...

Thank you!

docs still state to install miniconda2

The docs state to install miniconda2 instead of miniconda3. Python2 is depreciated and should be discouraged. If py2 is still needed for panx, then the user can create a py2 conda env in miniconda3.

Does panX lack py3 compatibility? If yes, are you planning on updating the code?

Fail to run TestSet using panX

Hello
I am teaching myself how to run panX using TestSet. Here is the command I run exactly following the instructions.
./panX.py -fn data/TestSet/ -sl TestSet -t 32 > TestSet.log 2> TestSet.err

However, I couldn't get the results as expected. Here is the error notification:
Traceback (most recent call last):
File "./panX.py", line 287, in
myPangenome.process_clusters()
File "/Users/dklabuser/limin/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters
myClusterCollector.estimate_raw_core_diversity()
File "/Users/dklabuser/limin/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity
self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species)
File "/Users/dklabuser/limin/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity
calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path)
File "/Users/dklabuser/limin/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity
with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file:
IOError: [Errno 2] No such file or directory: '/Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

When I try to run my own bacteria strains, using *.gbk produced by prokka, I got exactly the same problems. By comparing the step-by-step turorial, seems the problem starts either in step5 or step6.

Could anyone help solve the issues, Really appreciate. It is hard for bioinformatic bigginners to tackle all these problems. Thank you so much.

[Errno 2] No such file or directory: -> tmp_core_diversity.txt

Hello everyone,

I am trying to run pan-genome analysis on 32 AOA species using the full genebanks I obtained from the NCBI and this code

./panX.py -fn data/genebanks -sl pratice -t 6

However I ran into this error on Step 6

IOError: [Errno 2] No such file or directory: ./pan-genome-analysis/data/input_GenBank/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

Does anyone any ideas how to over come this please and thank you. I should note that the TestSet worked without any problems.

Thank you very much

DIAMOND fails to produce pairwise alignments in Step 5

Hello,

I am attempting to run panX on a dataset of 82 genomes from a single family of Alphaproteobacteria. Using the divide and conquer strategy, my command reads as follows:

echo "source activate panX; /nas3/awalling/software/pan-genome-analysis/panX.py -fn /nas3/awalling/software/pan-genome-analysis/data/Erythrobacteraceae -sl Erythrobacteraceae -dmdc -dcs 41 -dmsi 90 -dmsqc 90 -dmssc 90 -cg 1.0 -mi /nas3/awalling/software/pan-genome-analysis/metadata/erythrobacter_panx_metadata.tsv -mtf /nas3/awalling/software/pan-genome-analysis/metadata/erythrobacter_meta_config.tsv -t 32 > /nas3/awalling/software/pan-genome-analysis/Erythrobacteraceae2.log 2> Erythrobacteraceae.err" | qsub -V -N panX_erythrobacteraceae -q batch -e nas3/awalling/software/pan-genome-analysis/panx.erythrobacteraceae.pbs.log -o /nas3/awalling/software/pan-genome-analysis/panx.erythrobacteraceae.pbs.log -l ncpus=64 -l mem=200gb -l walltime=96:00:00

However, I receive the following error:

Traceback (most recent call last): File "/nas3/awalling/software/pan-genome-analysis/panX.py", line 272, in <module> myPangenome.clustering_protein_divide_conquer() File "/nas3/awalling/software/pan-genome-analysis/scripts/pangenome_computation.py", line 153, in clustering_protein_divide_conquer self.diamond_subject_cover_subproblem, self.mcl_inflation, self.diamond_path, self.diamond_dc_subset_size) File "/nas3/awalling/software/pan-genome-analysis/scripts/sf_cluster_protein_divide_conquer.py", line 168, in clustering_divide_conquer integrate_clusters(clustering_path,cluster_fpath) File "/nas3/awalling/software/pan-genome-analysis/scripts/sf_cluster_protein_divide_conquer.py", line 103, in integrate_clusters with open('%s%s'%(clustering_path,'subproblem_finalRound_cluster.output'))\ IOError: [Errno 2] No such file or directory: '/nas3/awalling/software/pan-genome-analysis/data/Erythrobacteraceae/protein_faa/diamond_matches/subproblem_finalRound_cluster.output'

As far as I can tell, the hangup is that during the subproblem blastp stage, no pairwise alignments are generated. From the end of /protein_faa/diamond_matches/diamond_blastp_subproblem_1.log:

Loading query sequences... [0s] Closing the input file... [0.005s] Closing the output file... [0s] Closing the database file... [0.005s] Deallocating taxonomy... [0s] Total time = 49.321s Reported 0 pairwise alignments, 0 HSPs. 0 queries aligned.

The files subproblem_1_cluster.output, subproblem_1.m8, subproblem_2_cluster.output, subproblem_2.m8, and subproblem_finalRound.faa are all blank.

I have attempted to fix this error by relaxing the e-value threshold with the -dme flag, but even with an e-value cutoff of 10 and a relaxed -cg of 0.8 this error replicates.

Is there a way to fix this issue without running an all-against-all blast and providing that matrix separately?

Best,

Alexandra

How strain file should be formatted?

I have prepared the file as:
GCF_000010725.1
GCF_000237365.1
GCF_000283655.1
...

Whenever I run step 1 and 3, get:
====== step01: strain list successfully loaded
====== starting step03: extract sequences from GenBank file
====== time for step03:
0.00 minutes (0.01 seconds)

But .data/strain_name/input_gbk remains emptyy

I am running as follows:
./panX.py -fn ./data/Azospirillum_sp/ -sl Azospirillum_sp -st 1 2 3 -t 10

Am I missing something?

Failed to read gbff files

I have a question. Most genome assemblies from ncbi are now available as gbff files. I am running into an issue where 2,124 genomes of a bacteria (Strep pyogenes) fail to load. But I think this has to do with the format.
The error I get is:

Traceback (most recent call last):
File "/opt/apps/panx/1.6.0/pan-genome-analysis-master/panX.py", line 256, in
myPangenome.extract_gbk_sequences()
File "/opt/apps/panx/1.6.0/pan-genome-analysis-master/scripts/pangenome_computation.py", line 128, in extract_gbk_sequences
extract_sequences(self.path, self.strain_list, self.folders_dict, self.gbk_present, self.enable_RNA_clustering)
File "/opt/apps/panx/1.6.0/pan-genome-analysis-master/scripts/sf_extract_sequences.py", line 156, in extract_sequences
gene_aa_dict, gene_na_dict, RNA_dict, enable_RNA_clustering)
File "/opt/apps/panx/1.6.0/pan-genome-analysis-master/scripts/sf_extract_sequences.py", line 58, in gbk_translation
locus_tag=feature.qualifiers['db_xref'][0].split(':')[1]
KeyError: 'db_xref'

Roary output visualization using panX

Hello,
I can't get to visualize my Roary results with panX. Using the link-to-server.py scrpit, i tried giving it my roary results directory (as -s) and the absolute path to the visualization (among multiple other tries). My issue is always the same : Roary output doesn't have the vis directory needed for visualization. I tried creating it but i don't have the same files to put in it (tried looking at what is supposed to be in the vis file).
I must have forgot something as it's specified that we can use roary output into panX visualization tool. Can someone help?

ete2 missing?

Dear all,

I have been having the following issue when installing panX. I followed the steps as in the site and used miniconda as indicated.

Traceback (most recent call last): File "./panX.py", line 6, in <module> from pangenome_computation import pangenome File "/usr/local/bin/panX/pan-genome-analysis/scripts/pangenome_computation.py", line 3, in <module> from cluster_collective_processing import clusterCollector File "/usr/local/bin/panX/pan-genome-analysis/scripts/cluster_collective_processing.py", line 1, in <module> from sf_geneCluster_align_makeTree import cluster_align_makeTree File "/usr/local/bin/panX/pan-genome-analysis/scripts/sf_geneCluster_align_makeTree.py", line 6, in <module> from ete2 import Tree ImportError: No module named ete2 (panX) shlomo@shlomo-HP-Z840-Workstation:/usr/local/bin/panX/pan-genome-analysis$ pip install ete2 Requirement already satisfied: ete2 in /home/shlomo/.conda/envs/panX/lib/python2.7/site-packages (2.3.10) (panX) shlomo@shlomo-HP-Z840-Workstation:/usr/local/bin/panX/pan-genome-analysis$

This is the first time I use Conda but if I got it right, ete2 is supposed to be there somewhere...
I'd appreciate some help here :-)

Support values

Is it possible to calculate support values in the core genome tree?

Something like this here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4810242/

Thank you very much.

Running sh run-TestSet.sh cause error

(panX) anna@Anna-PC:~/work/pan-genome-analysis$ sudo sh run-TestSet.sh
Traceback (most recent call last):
File "./panX.py", line 6, in
from pangenome_computation import pangenome
File "/home/anna/work/pan-genome-analysis/scripts/pangenome_computation.py", line 16, in
from sf_gain_loss import process_gain_loss
File "/home/anna/work/pan-genome-analysis/scripts/sf_gain_loss.py", line 5, in
from treetime import TreeAnc
File "build/bdist.linux-x86_64/egg/treetime/init.py", line 4, in
ImportError: cannot import name item

Ubuntu 18.04
Treetime 2018-10

Extracting data for individual genomes

Hi - I have analyzed and uploaded 30 bacterial genomes from environmental isolates to the local host/8000. Is there a way to extract all duplication events individual genomes?

Thanks
Markus

IOError: [Errno 2] No such file or directory: '/home/a/Desktop/pan-genome-analysis/data/geneCluster/old_clusters_longSplit.txt'

Anyone who has run it without a problem please help me in this problem. thanks in advance

pan genome visualization component "Sequence Alignment" not working

Hello,

I have had an issue both when viewing the panX Test Set as well as my own Test Set where the Sequence Alignment visualization component doesn't appear in FireFox. I was wondering if you knew if there was a particular npm module that after updating might be breaking one of your java script functions and if so, which module I need to version control for. Here is a screen shot of the issue:

And here are all of the NodeJS/npm dependencies currently downloaded on my Mac.

Thank you so much for your wonderful pan-genome visualization tool!

step 11 with metadata: IndexError: list index out of range

Hi,

I am running panX to calculate branch associations. I have a file to describe my metadata and another file of the numbers for the association. It works just fine until I get to step 11, then I get this Error:

Traceback (most recent call last):
File "./panX.py", line 329, in
myPangenome.export_coreTree_json()
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/pangenome_computation.py", line 229, in export_coreTree_json
json_parser(self.path, self.folders_dict, self.fpaths_dict, self.metainfo_fpath, self.meta_data_config, self.clean_temporary_files)
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 258, in json_parser
coreTree_dict=core_tree_to_json(tree, path, metadata_process_result, strain_list)
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 124, in core_tree_to_json
core_tree_dict["children"].append(core_tree_to_json(child, path, metadata_process_result, strain_list))
File "/beegfs/work/workspace/ws/fr_na50-panx-0/pan-genome-analysis/scripts/sf_coreTree_json.py", line 132, in core_tree_to_json
core_tree_dict['attr'][head]=strain_meta_dict[accession][index_head]
IndexError: list index out of range

What does this mean? I would appreciate any help!
Nurper

How to prepare file "stran_list"(step 01) ?

Traceback (most recent call last):
File "./panX.py", line 287, in
myPangenome.process_clusters()
File "/home/Project2/pan-genome-analysis/scripts/pangenome_computation.py", line 163, in process_clusters
strain_list=self.strain_list,
AttributeError: pangenome instance has no attribute 'strain_list'

I want to use strain's information with my own format(not Genbank format), so I only input extracted strain's sequeces and run step 5,6,7,8,9,10, but I don't know what's the step1 meaning. How should I prepare "strain_list" , and how the file "geneID_to_description.cpk" output or I should prepare it by myself?

Thank you!

run-TestSet.sh: 25: ./panX.py: Permission denied

I have installed it as mentioned in the given link https://github.com/neherlab/pan-genome-analysis/blob/master/README.md

step 1
Installed:- git clone https://github.com/neherlab/pan-genome-analysis.git
unzip
step 2 wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
export PATH=~/miniconda2/bin:$PATH
conda env create -f panX-environment.yml
source activate panX

Step 3
**dimple@DESKTOP-6JV15RL:~/pan-genome-analysis-master/pan-genome-analysis-master$ sudo sh run-TestSet.sh

Given the below error
run-TestSet.sh: 25: ./panX.py: Permission denied

Error step 10

Hello again - Any explanation what may cause an issue with certain files in the genecluster folder?
Thanks,
Markus

Starting step10: create json file for geneDataTable visualization
Traceback (most recent call last):
File "./panX.py", line 322, in
myPangenome.export_geneCluster_json()
File "/software/pangenomeanalysis/pan-genome-analysis/scripts/pangenome_computation.py", line 225, in export_geneCluster_json
geneCluster_to_json(self.path, self.enable_RNA_clustering, self.store_locus_tag, self.raw_locus_tag, self.optional_table_column)
File "/software/pangenomeanalysis/pan-genome-analysis/scripts/sf_geneCluster_json.py", line 171, in geneCluster_to_json
'"divers":"'+gene_diversity_Dt[clusterID]+'"',
KeyError: 'GC00000193_3'

Gene gain/loss, duplication

Hi Richard - hope this is my last question for you. I uploaded my output files to the server. Is there a way to extract the information in the 'gene cluster table' and 'sequence alignment window' to obtain gene duplication, gain, loss events for each strain. I'm interested in making pie charts showing the percent ancestral genes, genes gain, gene loss, and duplication events.
How was the geneGainLossEvent.json file in the TestSet folder generated. Would a similar file help me to retrieve the info I am looking for?

Markus

Could this software uesd in animal?

Hi!
I saw from the introduction that this software is designed for use by prokaryotes. We now want to do a simple mammalian diploid for pan-genomics research. I don’t know if your software can achieve good results on animal pan-genomes.
Looking forward to hearing from you！

xb_xref err

Hi,
What I am reporting is still ongoing issue since I noticed a few months ago.
I attached system log below.

Traceback (most recent call last):
File "/home/Program/pan-genome-analysis/panX.py", line 256, in
myPangenome.extract_gbk_sequences()
File "/home/Program/pan-genome-analysis/scripts/pangenome_computation.py", line 128, in extract_gbk_sequences
extract_sequences(self.path, self.strain_list, self.folders_dict, self.gbk_present, self.enable_RNA_clustering)
File "/home/Program/pan-genome-analysis/scripts/sf_extract_sequences.py", line 156, in extract_sequences
gene_aa_dict, gene_na_dict, RNA_dict, enable_RNA_clustering)
File "/home/Program/pan-genome-analysis/scripts/sf_extract_sequences.py", line 58, in gbk_translation
locus_tag=feature.qualifiers['db_xref'][0].split(':')[1]
KeyError: 'db_xref'

I analyzed the log and arguments, it has to be changed like

locus_tag = contig.features[0].qualifiers['db_xref'][0].split(":")[1]

Please check it out and update it.
Thanks.

Newick

Hi - I'm running into an error during step 8. Please see output below. Any suggestions how to fix this error?
Best wishes,

Markus

====== starting step08: run fasttree and raxml for tree construction
fasttree time-cost: 0.00 minutes (0.01 seconds)
Traceback (most recent call last):
File "./panX.py", line 303, in
myPangenome.build_core_tree()
File "/media/LargeStorage/Markus_cold_adaptation-master/pangenomeanalysis/pan-genome-analysis/scripts/pangenome_computation.py", line 200, in build_core_tree
aln_to_Newick(self.path, self.folders_dict, self.raxml_max_time, self.raxml_path, self.threads)
File "/media/LargeStorage/Markus_cold_adaptation-master/pangenomeanalysis/pan-genome-analysis/scripts/sf_core_tree_build.py", line 44, in aln_to_Newick
resolve_polytomies('initial_tree.newick0','initial_tree.newick')
File "/media/LargeStorage/Markus_cold_adaptation-master/pangenomeanalysis/pan-genome-analysis/scripts/sf_core_tree_build.py", line 8, in resolve_polytomies
tree = Tree(newickString);
File "/home/markusd/.local/lib/python2.7/site-packages/ete2/coretype/tree.py", line 218, in init
read_newick(newick, root_node = self, format=format)
File "/home/markusd/.local/lib/python2.7/site-packages/ete2/parser/newick.py", line 231, in read_newick
raise NewickError('Unexisting tree file or Malformed newick tree structure.')
ete2.parser.newick.NewickError: Unexisting tree file or Malformed newick tree structure.

single copy core genes

Hello Prof. Richard Neher @rneher

If I want to obtain all single-copy genes that are present in all the studying genomes, which alignment files should I use? (given that I use the default 1.0 for the -cg parameter)

Are all the GC*_na_aln.fa under the /vis/geneCluster directory single-copy genes? I noticed that some aln filenames include a serial no. like GC*_x_na_aln.fa, where x=1,2,3,etc. Does it mean this gene is more than one copy in the genome?

I ask because I would like to obtain only the single-copy genes alignment and concatenate into a single super alignment or infer a supertree from each single gene tree ultimately.

Thank you.

error running the test set in the py3 branch

Hi.
I am testing panX in my laptop. I already had miniconda3 installed so I tried the py3 branch, after failing to install the main branch and reading #27 . It seems to install properly but I got an error message running the test set:

mv: cannot stat 'GC00000143_2_aa_aln.fa': No such file or directory mv: cannot stat 'GC00000143_2_na_aln.fa': No such file or directory mv: cannot stat 'GC00000143_1_aa_aln.fa': No such file or directory mv: cannot stat 'GC00000143_1_na_aln.fa': No such file or directory Traceback (most recent call last): File "/.../pan-genome-analysis/./panX.py", line 312, in <module> myPangenome.infer_gene_gain_loss_pattern() File "/.../pan-genome-analysis/pan-genome-analysis/scripts/pangenome_computation.py", line 214, in infer_gene_gain_loss_pattern process_gain_loss(self.path, self.merged_gain_loss_output) File "/.../pan-genome-analysis/pan-genome-analysis/scripts/sf_gain_loss.py", line 104, in process_gain_loss set_seq_to_patternseq(tree) File "/.../pan-genome-analysis/pan-genome-analysis/scripts/sf_gain_loss.py", line 393, in set_seq_to_patternseq node.sequence = node.patternseq AttributeError: can't set attribute

Any clue why this happened. Is it that I tried to install the main branch first?

Very diverse pangenome?

Hello,

I wanted to look at protein presence-absence of a broader group of genomes (several different orders from alpha-proteobacteria). Would you say it's possible with panX, or should I just run OrthoMCL or something like this?

Thank you!

PanX visualization: No gene tree

Hi,

Since yesterday panX visualization wont show the gene tree (individual gene) for data I've already analyzed and looked at with the visualization panel and that used to work fine. Is there any update or did I mess up with some file ?

Also I have a question about the output files: for example for the core genome:
In the "core_geneList.txt" I have 2363 GC.. in the "tmp_core" 1227 and in the visualization window 2965 are displayed. What's the difference between them ? same for the total number of gene

Can you help me with this issue ?
Thank you
@rneher

The meaning of file name in geneCluster directory

I used panX for pan genome analysis (v1.5.1) on ubuntu (18.04.1 LTS).

I would like to know the meaning of file names in geneCluster directory in my result. In this directory, I found many faa and fsa files for each genecluster. Some files had additonal number or character after the name of gene cluster (GCxxxxxxxx_x, px or rx; x indicated a number) in their file names. I wonder that the number meant the serial number for distantly related genes split in the post processing step, and "p+number" did those for paralogous genes. But what did "r + number" mean?

Kind regards,

Error in step09: infer presence/absence and gain/loss patterns of all genes

I am running panX on 63 genomes (6 complete, 57 draft genomes).

The raxml.log shows me that RAxML finishes succesfully but 8 identical sequences have been found. I attached the terminal output for step09 below.

Any ideas? Thank you very much.

======  starting step09: infer presence/absence and gain/loss patterns of all genes

0.00    -TreeAnc: set-up

0.06    -TreeAnc: loading alignment failed... 

0.06    -TreeAnc.infer_ancestral_sequences with method: ml, joint

0.06    TreeAnc.infer_ancestral_sequences: ERROR, alignment or tree are missing
Traceback (most recent call last):
  File "./panX.py", line 312, in <module>
    myPangenome.infer_gene_gain_loss_pattern()
  File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 214, in infer_gene_gain_loss_pattern
    process_gain_loss(self.path, self.merged_gain_loss_output)
  File "/data/tools/pan-genome-analysis/scripts/sf_gain_loss.py", line 102, in process_gain_loss
    tree = infer_gene_gain_loss(path)
  File "/data/tools/pan-genome-analysis/scripts/sf_gain_loss.py", line 40, in infer_gene_gain_loss
    n.genepresence = n.sequence
AttributeError: 'Clade' object has no attribute 'sequence'

Core genome not found even with low soft core parameter in similar genera. Error in step06 & step08

Hello all,

I have successfully run panX analyses on three different individual genera using,
./panX.py -fn data/myGenus -sl myGenus -t 2

I tried to run all three genera together with a total of 71 genomes (majority of which are draft genomes). It returned this
as an error:
====== starting step06: align genes in geneCluster by mafft and build gene trees Traceback (most recent call last): File "./panX.py", line 287, in <module> myPangenome.process_clusters() File "/disk3/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters myClusterCollector.estimate_raw_core_diversity() File "/disk3/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species) File "/disk3/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path) File "/disk3/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file: IOError: [Errno 2] No such file or directory: 'pan-genome-analysis/data/Chloro_Cocco_Prasino/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

I followed #8 thread and ran these genomes with -cg 0.7,0.5,0.3,0.1 and was unsuccessful.

Error messages were all similar to:
====== starting step08: run fasttree and raxml for tree construction fasttree time-cost: 0.26 minutes (15.88 seconds) RAxML tree optimization within the timelimit of 30 minutes RAxML branch length optimization and rooting Traceback (most recent call last): File "./panX.py", line 303, in <module> myPangenome.build_core_tree() File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 200, in build_core_tree aln_to_Newick(self.path, self.folders_dict, self.raxml_max_time, self.raxml_path, self.threads) File "/data/tools/pan-genome-analysis/scripts/sf_core_tree_build.py", line 75, in aln_to_Newick shutil.copy('RAxML_result.branches', out_fname) File "/anaconda2/envs/panX/lib/python2.7/shutil.py", line 139, in copy copyfile(src, dst) File "/anaconda2/envs/panX/lib/python2.7/shutil.py", line 96, in copyfile with open(src, 'rb') as fsrc: IOError: [Errno 2] No such file or directory: 'RAxML_result.branches'

The raxml.log reads:
`Option -T does not have any effect with the sequential or parallel MPI version.
It is used to specify the number of threads for the Pthreads-based parallelization

RAxML can't, parse the alignment file as phylip file
it will now try to parse it as FASTA file

ERROR: Sequence EhV145 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV156 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV164 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV18 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV201 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV202 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV203 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV207 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV208 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV84 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV86 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV88 consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence EhV99B1 consists entirely of undetermined values which will be treated as missing data
ERROR: Found 13 sequences that consist entirely of undetermined values, exiting...`

Unlike johannes from #8 I can't just delete these sequences because I am working with a somewhat small number of genomes and the raxml.log is actually reporting all sequences of one genera.

I then did a three way pairwise comparison for the different genera using
./panX.py -fn data/myGenus -sl myGenus -t 2
By this I mean, analyses with the genomes from genera
A and B together were successful,
A and C together were successful,
B and C together were successful, but
A, B, C together were unsuccessful.

Since the different analyses I compared generated core genomes that are present in 100% of the strains, there should be a core genome between all 3 genera. Any thoughts on what else I could do to try and fix this? Any help would be appreciated. Please let me know if there is more info I could provide. Thank you.

Issue with visualization

So I know this is not the GitHub repository for the pan-genome-visualization package but I am super lost regarding how to make the visualization work.

What exactly are the arguments to be used for the link-to-server.py script???? I understand that argument -s needs to be the folder containing the files from the output of pan-genome-analysis. However, I do not know what the argument should be for -v. I have tried the absolute path of the pan-genome-visualization package, the relative path of the local host server 800, but to no avail.

Could you please help me out with this one? Thanks in advance.

Cheers,
Pablo

Step for generation of archive all_protein_seq.cpk

I have the question of when is the archive all_protein_seq.cpk generated. Because if I run the steps one by one this archive is not generated. and when I want to run the step 6 naturally doesn't work.
When I run all the steps at the same time in step 5 broken because the archives .faa disapear. and said that there is not .faa archives.
Thanks in advance.

neherlab / pan-genome-analysis Goto Github PK

pan-genome-analysis's Introduction

panX: microbial pan-genome analysis and exploration

Table of contents

Pipeline overview

Quick start

Installing dependencies

Conda

Overview of dependencies:

How to run

Directory structure and analysis output

Command line arguments

pan-genome-analysis's People

Contributors

Stargazers

Watchers

Forkers

pan-genome-analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org