I've stunbled upon PDB parsing libraries in other languages (specifically Haskell, <a

Parsing performance,about douweschulte/pdbtbx

Comments (20)

douweschulte commented on June 11, 2024

Interesting point, the (very limited) comparison I did between cctbx (C based used with python api) and pdbtbx came to similar speeds, but with much smaller PDBs. There is always room for improvement, but I do not right now know how to get there. Although I would take a look into the places where unnecessary new memory is used and in the way of adding new atoms to the list. I assume the really fast libraries have fancier ways of finding the right conformer to add the atom to. Also I see now that the 1HTQ has 10 models of nearly 100 000 atoms, I think that in that case the verification also takes a long while.

So yes nice point, I will do some measuring and see if there is something obvious I can improve. It would be a shame to use an extremely fast language and have a slower library than a (in normal cases) slower language ;-).

from pdbtbx.

douweschulte commented on June 11, 2024

Sidenote I think this is a great resource if you ever want to do some performance work in Rust: https://nnethercote.github.io/perf-book/introduction.html or another book https://www.packtpub.com/product/rust-high-performance/9781788399487.

from pdbtbx.

DocKDE commented on June 11, 2024

I agree, it would be a shame that's why I opened this issue. I did some few optimizations in my own crate and I'm mostly happy with its performance. The step taking the longest by far is the actual parsing :)

from pdbtbx.

DocKDE commented on June 11, 2024

I ran a quick flamegraph analysis on a simple executable which does nothing but opening a PDB file. Apparently, the vast majority of time is spent within the chain::add_atom function. Why exactly that is, I don't know (yet)

from pdbtbx.

douweschulte commented on June 11, 2024

Nice! I would assume (definitively needs some measurements) that it takes a long time to find the correct residue + conformer to add to. If this is the case I think starting from the end of the list with the search for the correct res + conf is faster.
Also I think that the 1HTQ stresses the validation a lot because pdbtbx thinks all atoms from all models are wrong.

from pdbtbx.

DocKDE commented on June 11, 2024

I investigated some more.
The binary I was looking at contains just this code:

fn main() {
    let _a = pdbtbx::open_pdb("large.pdb", pdbtbx::StrictnessLevel::Strict);
}

The PDB file is the same as in the pdbtbx example files which I used to test for files with more than 100k atoms. I've never profiled anything before so I don't really know much about what I'm doing but I used perf to take recordings from running this binary and determined (as with the flamegraph before) that most of the time is spent within the chain::add_atom method.
Looking at the generated assembler code, I got this:

  0,19 │ c0:   add    rbp,0x38                                                                                   
  0,05 │       add    r12,0xffffffffffffffc8                                                                   
       │     ↓ je     114                                                                                       
  1,00 │ ca:   cmp    QWORD PTR [rbp+0x0],rbx                                                                      
 98,60 │     ↑ jne    c0                                                                                         
  0,01 │       mov    rdi,QWORD PTR [rbp+0x8]                                                                     
       │       test   r13,r13                                                                                    
       │       setne  al

I have to confess that I don't know anything about assembly but the hottest instruction is jne which is a "jump if not equal" one. If you happen to know more, I'd love to hear your thoughts.
I reasoned that, according to this finding, the bottleneck was the evaluation of one of the if statements in the method. I commented out some stuff, recompiled, ran the binary again and found that the code evaluation that takes most of the time (by a large margin) is the two lines within the second if statement:

if !found {
    self.residues.push(new_residue);
    current_residue = self.residues.last_mut().unwrap();
}

Also, the bottleneck is not the evaluation of found but rather the instructions within the body. Since these two depend on one another and/or other things, I couldn't narrow it down further for the time being.

This shows, however, that the for loop determining whether the current residue is already present is, in fact, not the problem, unless I'm much mistaken here.

Eliminating the if statement (just for testing purposes) cuts the run time of the binary by some >80% and the next most expensive calculations come from lex_atom_basics and parse_number

from pdbtbx.

DocKDE commented on June 11, 2024

Okay, I dug a little more into profiling the performance and I think I've come to the bottom of the issue now. What I wrote earlier seems to have been an incorrect assessment of what the issue was.
I performed a couple runs with perf on a binary compiled from this:

fn main() {
    let _a = pdbtbx::open_pdb("large.pdb", pdbtbx::StrictnessLevel::Strict);
}

Then I recorded a profile with perf record --call-graph=dwarf ./target/release/pdbtest and looked at it with perf report. Looking at the call stack, the conclusion remains that the chain::add_atom method is the culprit. However it appears that it, in turn, calls this: core::tuple::<impl core::cmp::PartialEq for (A,B)>::eq (inlined) which makes up some 88% of the time spent during run time.
So it seems clear that the bottleneck is the comparison if residue.id() == residue_id within the add_atom method.
Since the comparison has already been inlined by the compiler, I'm not sure how to improve the situation just yet but I wanted to share the information.
If you have any criticisms or suggestions, I'm all ears!

P.S. Removing everything in the loop body except current_residue = residue cuts the run time by some 80% or so which seems to support my assessment.

from pdbtbx.

DocKDE commented on June 11, 2024

I created a PR that adds some clippy lints and the reversal of iteration that you mentioned. Turns out your assessment of what the bottleneck is, was exactly right. :)
With this change the tuple comparison mentioned above is still the major bottleneck but it's less pronounced. If you have any ideas how to increase performance even more, I'm interested...
#82

from pdbtbx.

douweschulte commented on June 11, 2024

The comparison is done when searching for the right residue when inserting a new one. By going from the back in most cases ((n-1)/n where n is the average number of atoms in a residue) the residue is found immediately, but if the residue has not been added yet it still has to go through the whole list. If we assume that no residue is ever defined with another residue in between, the best way forward would be the same trick as chains in the parser. Keep the new residue local in the parser and add all atoms until a new residue is found. At this point the residue has to be added to the current chain and a new one has to be made.
So that would be a huge improvement as well I assume, but I am not entirely sure this assumption holds :/.

If it does not hold some kind of set made for constant lookup time with just the residue ids could be beneficial by removing the need for the full comparison loop.

from pdbtbx.

DocKDE commented on June 11, 2024

I was thinking along the same lines just now (I think). If I understand correctly, each time an atom is to be added (for large.pdb that would be more than 100k times) the entire current chain is traversed in search for a residue to add the atom to.
So each time an atom of a new residue is added, the whole chain is searched in vain which would amount to a lot of pointless searching. I hope I got that right, I tried to use my own words for it :D

My suggestion would be to introduce either a set holding the already present residue IDs or do some sort of memoization via a HashMap to store residues that have already been added.
Either way, this should reduce lookup times for residue IDs rather drastically. Probably the set approach would be easier to do, I have already used sets with success in my own pdbman project to reduce the traversal of Vecs.

from pdbtbx.

douweschulte commented on June 11, 2024

Yes that is exactly what I was thinking about. I think that changing the internal storage from Vec to some other structure is justified for this cause. That would probably also improve performance in some other cases.

from pdbtbx.

DocKDE commented on June 11, 2024

I would suggest using an IndexSet from the indexmap crate (https://docs.rs/indexmap/latest/indexmap/set/struct.IndexSet.html). It's an ordered set which has implementations for many standard methods for Vecs but features constant-time lookups. I think this should be a good fit because uniqueness is a desired property.
Do you think, the whole hierarchy of structs, Model, Chain, Residue, Conformer, Atom which is now stored in Vecs on each level should be converted to sets?

from pdbtbx.

DocKDE commented on June 11, 2024

A possible pitfall (which I've previously come across) is that storing Residues, Atoms and the like in HashSets or IndexSets won't work out of the box because they all incorporate f64 fields which don't implement Hash. A possible solution here would be using OrderedFloats from here (https://docs.rs/ordered-float/latest/ordered_float/struct.OrderedFloat.html).
I suppose all of this would require some major refactoring but I think it should lead to significant speedups during parsing.

from pdbtbx.

douweschulte commented on June 11, 2024

Implementing Hash is doable actually. The equality is defined by the serial_numbers/names for each level so for floats are involved in that. Creating a commit implementing Hash for everything.
Edit not so sure anymore if this is correct.

from pdbtbx.

DocKDE commented on June 11, 2024

I had a look at this myself and came to the conclusion that implementing Hash on Atom (which is the only one that can't just be derived) only works if you exclude all of the fields that are floats.
For example one could hash a tuple of name, serial number and so on but not including the position, occupancy and b_factor etc.
It's only important that this holds: k1 == k2 -> hash(k1) == hash(k2) so we won't have logic errors. So if the Hash implementation is less strict (because it doesn't consider a few fields) than the Eq one, it should be fine. The only question would then be if this is a good idea.
Other than that we could, in fact, use OrderedFloats from that crate but I'm not sure if that's feasible because we would have to implement Serde and rstar traits by hand which, I think, is not possible since neither the trait nor the type are local to this crate.

from pdbtbx.

DocKDE commented on June 11, 2024

Additional note: I just realized that using sets as drop-in replacement for the Vecs that are currently used needs some more thought because creating a mutable reference to a set item is not possible. It should be possible to create a workaround but that may come at additional cost, especially if the order of the set is to be preserved.

from pdbtbx.

douweschulte commented on June 11, 2024

Okay I tried some things out myself, including the very nice IndexMap you pointed out earlier. And I think it will prove to be very hard to keep a set of some sort in permanent use, as it will be nearly impossible to keep all data in sync especially if you want to keep the opportunity for people to change the id of residues. So I think the approach you took in #86 has the most potential. Here you create a set for single usage, only while parsing, removing many possibilities for out of sync data. The implementation then could be in the form as you did, an extra function which takes the HashSet of already added Residues. Or there could be an intermediate HashMap while parsing which is turned into a Chain as soon as a new chain is detected.

from pdbtbx.

DocKDE commented on June 11, 2024

What do you think of the approach with a global struct holding the HashSets as in PR #87 ?

from pdbtbx.

douweschulte commented on June 11, 2024

I think it has the same pitfalls as storing it within the struct, namely that the information is retained when the user can modify the whole structure. I think that keeping the set/map alive can only be done when there is a way to automatically update the key when the user has had mutable access to the internal layer. This would entail designing a new collection type with this behaviour. Which as soon as a mutable access to an inner type is dropped will automatically update the key that is used from that structure. I think this is possible, but just a lot of work which may not be worth our time. I will push my own try to solve this issue in a separate branch shortly and would be very interested in what you think about that approach.

from pdbtbx.

DocKDE commented on June 11, 2024

Sure, looking forward to it!

from pdbtbx.

Parsing performance about pdbtbx HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent