turtletools / caretta Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 4.0 23.02 MB

A software-suite to perform multiple protein structure alignment and structure feature extraction.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

alignment protein protein-structure

caretta's People

Contributors

Stargazers

Watchers

Forkers

noxjonas minghao2016 noahharrison64 wook2014

caretta's Issues

Error when writing aligned pdb

Hi.

Interested in using this tool, but I'm receiving the following error:

  File "/root/miniconda3/bin/caretta-cli", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/caretta/bin/caretta-cli", line 103, in <module>
    app()
  File "/root/miniconda3/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/root/miniconda3/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/miniconda3/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/caretta/bin/caretta-cli", line 87, in align
    multiple_alignment.StructureMultiple.align_from_pdb_files(
  File "/caretta/caretta/multiple_alignment.py", line 221, in align_from_pdb_files
    msa_class.write_files(
  File "/caretta/caretta/multiple_alignment.py", line 624, in write_files
    self.write_superposed_pdbs(pdb_folder)
  File "/caretta/caretta/multiple_alignment.py", line 688, in write_superposed_pdbs
    reference_pdb[helper.get_alpha_indices(reference_pdb)]
IndexError: arrays used as indices must be of integer (or boolean) type

There's no error if I use the --no-pdb flag. But those alignments is what I'm after.
I'm using the cli and I'm running it as docker image. Not sure if that might have anything to do with it.

Any ideas?

Add flag to output distance matrix

GUI - add a select all button for PDBs

and a line that suggests spamming the Enter key instead of mouse clicks for selection

A type error occurs

Hi,
How can I solve the type error below?

Python version 3.8.13

(caretta) [user1@localhost hw]$ caretta-cli ldh/
Traceback (most recent call last):
File "/home/use1r/.conda/envs/caretta/bin/caretta-cli", line 2, in
from caretta import multiple_alignment
File "/home/user1/.conda/envs/caretta/lib/python3.8/site-packages/caretta/multiple_alignment.py", line 177, in
class StructureMultiple:
File "/home/user1/.conda/envs/caretta/lib/python3.8/site-packages/caretta/multiple_alignment.py", line 1028, in StructureMultiple
) -> tuple[list[str], dict[str, ndarray]]:
TypeError: 'type' object is not subscriptable

Numba version causing issues in alignment

TypeError while aligning

I am getting the following error while using caretta from the command line (I can provide the list of pdbs to reproduce the error if needed):

caretta-cli list_IPR000072_pdbs.dat 
Found 66 structure files
Found 66 protein structures 100%|
Computed invariants in 13.59 seconds
Found 66 structures with valid invariants 100%|
Aligning:   3%|███   
Traceback (most recent call last):

  File "/home/disat/amuntoni/miniconda3/bin/caretta-cli", line 127, in <module>
    app()

  File "/home/disat/amuntoni/miniconda3/bin/caretta-cli", line 108, in align
    multiple_alignment.align_from_structure_files(

  File "/home/disat/amuntoni/miniconda3/lib/python3.9/site-packages/caretta/multiple_alignment.py", line 519, in align_from_structure_files
    alignment = msa_class.multiple_align(

  File "/home/disat/amuntoni/miniconda3/lib/python3.9/site-packages/caretta/multiple_alignment.py", line 278, in multiple_align
    self.alignment = self.progressive_align(self.tree,

  File "/home/disat/amuntoni/miniconda3/lib/python3.9/site-packages/caretta/multiple_alignment.py", line 243, in progressive_align
    make_intermediate_node(node_1, node_2, node_int)

  File "/home/disat/amuntoni/miniconda3/lib/python3.9/site-packages/caretta/multiple_alignment.py", line 204, in make_intermediate_node
    score_matrix = final_sequences[n1].score_function(

  File "/home/disat/amuntoni/miniconda3/lib/python3.9/site-packages/caretta/multiple_alignment.py", line 332, in score_function
    aln_1, aln_2, score = dtw.smith_waterman(np.arange(score_matrix.shape[0]),

TypeError: expected UniTuple(int64 x 2), got None

Is it related to a "bad" PDB file? How can I spot it?
Thank you for your help!

Would it be possible to include an option to apply rot + trans matrix to original PDB as oppose to the cleaned PDB?

This would be useful if you are trying to align structures which contain cofactors, such as ligands.

The third aligned output structure would gone for no reason

This is a super useful tool that helped me a lot!

However, when I took more than three structures as input, the outputs in the superposed_pdbs would always miss the third one from input. So I guess it's might be a bug.

caretta needs to be run outside of the input folder

Hi,
I'm trying caretta for the first time and I ran into an error AssertionError: Could not understand input caretta_results using the PDB files obtained from ColabFold as well as the ones downloaded from PDB. The command was caretta-cli . -t 8
The whole traceback is:

Traceback (most recent call last):

  File "/home/gsn/mambaforge/envs/caretta/bin/caretta-cli", line 127, in <module>
    app()

  File "/home/gsn/mambaforge/envs/caretta/bin/caretta-cli", line 108, in align
    multiple_alignment.align_from_structure_files(

  File "/home/gsn/mambaforge/envs/caretta/lib/python3.9/site-packages/caretta/multiple_alignment.py", line 476, in align_from_structure_files
    pdb_files = helper.parse_protein_files_and_clean(input_files, output_files.cleaned_pdb_folder)

  File "/home/gsn/mambaforge/envs/caretta/lib/python3.9/site-packages/caretta/helper.py", line 166, in parse_protein_files_and_clean
    protein_files = get_structure_files(input_value)

  File "/home/gsn/mambaforge/envs/caretta/lib/python3.9/site-packages/geometricus/protein_utility.py", line 134, in get_structure_files    assert type(protein_file) == str, f"Could not understand input {protein_file}"

AssertionError: Could not understand input caretta_results

Then I realized that when caretta needs to be run outside of the input folder; otherwise it will take the folder caretta_results as input. Maybe it should be mentioned in the document?
Thanks.

Regarding features and alignment

This is a great software and it is very helpful :)
I'm currently trying to do the alignment for 66 proteins and managed to get the alignment file and the features in pkl file.
Would like to ask how can we link the alignment and features together as they are arranged randomly? Thank you so much!

Error with input

Dear all,

Thank you for the software. After I download the software, and test the sample pdb files in the software. It showed a problem after input the command with "caretta-cli align". The error was " assert type(protein_file) == str, f"Could not understand input {protein_file}". Do you know what happens and what I needed to do?
Thanks,
Lee

Add option to change superposition parameters from CLI/GUI

This could be a JSON/TOML/any kind of key value file, with the default parameters written in it that you pass in to the CLI via --superpose-parameters.
Each superposition function needs documentation to make clear what parameters it exposes.
The GUI needs a dropdown to select the superposition function which should trigger a set of textareas for the corresponding parameters

Test web app file loading

Is it possible to output Tm scores?

Hi there,

I was wondering if it's possible to output the tm-scores between proteins. Also what exactly is the distance matrix outputted?

All the best

python 2.7 problem ? MacOS 11.1

I just installed carretta manually on my Mac (MacOS 11.1). Can you suggest where the problem might be? Is my machine calling python2.7 by default instead of python3 ?

thank you

60 % caretta-cli test
@> 2465 atoms and 1 coordinate set(s) were parsed in 0.02s.
@> 2739 atoms and 1 coordinate set(s) were parsed in 0.02s.
@> 2647 atoms and 1 coordinate set(s) were parsed in 0.02s.
Found 3 PDB files
@> 2465 atoms and 1 coordinate set(s) were parsed in 0.02s.
@> 2739 atoms and 1 coordinate set(s) were parsed in 0.02s.
@> 2647 atoms and 1 coordinate set(s) were parsed in 0.02s.
Calculating pairwise distances...
Constructing neighbor joining tree...
Aligning [####################################] 100%
Traceback (most recent call last):
File "/Applications/Darwin/miniconda3/bin/caretta-cli", line 7, in
exec(compile(f.read(), file, 'exec'))
File "/Applications/Darwin/caretta/bin/caretta-cli", line 115, in
app()
File "/Applications/Darwin/miniconda3/lib/python3.8/site-packages/typer/main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "/Applications/Darwin/miniconda3/lib/python3.8/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/Applications/Darwin/miniconda3/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Applications/Darwin/miniconda3/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Applications/Darwin/miniconda3/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Applications/Darwin/miniconda3/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/Applications/Darwin/caretta/bin/caretta-cli", line 97, in align
multiple_alignment.StructureMultiple.align_from_pdb_files(
File "/Applications/Darwin/caretta/caretta/multiple_alignment.py", line 295, in align_from_pdb_files
msa_class.write_files(
TypeError: write_files() got multiple values for argument 'only_dssp'

codec utf-8 error preventing align

Hi there, I'm using caretta-cli to align all .pdb files in a folder but I get an error message after the files are parsed. I can see that cleaned pdb files are created for all my input pdb but the alignment fails to run. Here is the error message I get:

File "/Users/s1427471/anaconda3/envs/snakes/bin/caretta-cli", line 127, in
app()

File "/Users/s1427471/anaconda3/envs/snakes/bin/caretta-cli", line 108, in align
multiple_alignment.align_from_structure_files(

File "/Users/s1427471/anaconda3/envs/snakes/lib/python3.9/site-packages/caretta/multiple_alignment.py", line 476, in align_from_structure_files
pdb_files = helper.parse_protein_files_and_clean(input_files, output_files.cleaned_pdb_folder)

File "/Users/s1427471/anaconda3/envs/snakes/lib/python3.9/site-packages/caretta/helper.py", line 169, in parse_protein_files_and_clean
protein = parse_structure_file(str(protein_file)).select("protein")

File "/Users/s1427471/anaconda3/envs/snakes/lib/python3.9/site-packages/geometricus/protein_utility.py", line 82, in parse_structure_file
protein = pd.parsePDBStream(f)

File "/Users/s1427471/anaconda3/envs/snakes/lib/python3.9/site-packages/prody/proteins/pdbfile.py", line 313, in parsePDBStream
lines = stream.readlines()

File "/Users/s1427471/anaconda3/envs/snakes/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

I'm using macOS. Any ideas how to fix this would be much appreciated :)

No superimposed PDB

Hi, I am very much interested in using this program in my project. I am trying to align 831 predicted structures, they are well analysed with sequence alignments in previous studies but I am having problems in aligning them with Caretta. The run finishes well without an error message, but I can't find superimposed PDB file but only cleaned_pdb and result_pdb, and the result_pdb is virtually identical to the cleaned_pdb. also, I see very high RMSD values and very low TM scores which is not the case if I align some of them in PyMOL. Could you please help me to find what I am missing? Thank you in advance!

make_rmsd_coverage_tm_matrix usage

Hello, may I have an example for the usage of make_rmsd_coverage_tm_matrix? Thank you :)

Originally posted by @lingnus1 in #14 (comment)

Create pip package and start versioning

Change superpose functionality to use central protein as reference

Currently, superpose and write_superposed_pdbs have different behaviors - superpose uses the first protein as the reference to rotate others against, and write_superposed_pdbs uses the set of core indices to rotate. Both are problematic - the first protein may be distant from the rest, and the core indices may be empty for divergent datasets.

Solution:

Use the "central" protein (selected using the Geometricus similarity matrix as the protein with a reasonable level of similarity to the rest) as the reference to superpose
Print a warning message/error message when no core indices are found as this could represent a dataset with too divergent proteins for meaningful alignment or with multiple groups of proteins which should ideally be aligned separately.

Related to Issue #8