Comments (20)
Interesting point, the (very limited) comparison I did between cctbx (C based used with python api) and pdbtbx came to similar speeds, but with much smaller PDBs. There is always room for improvement, but I do not right now know how to get there. Although I would take a look into the places where unnecessary new memory is used and in the way of adding new atoms to the list. I assume the really fast libraries have fancier ways of finding the right conformer to add the atom to. Also I see now that the 1HTQ
has 10 models of nearly 100 000 atoms, I think that in that case the verification also takes a long while.
So yes nice point, I will do some measuring and see if there is something obvious I can improve. It would be a shame to use an extremely fast language and have a slower library than a (in normal cases) slower language ;-).
from pdbtbx.
Sidenote I think this is a great resource if you ever want to do some performance work in Rust: https://nnethercote.github.io/perf-book/introduction.html or another book https://www.packtpub.com/product/rust-high-performance/9781788399487.
from pdbtbx.
I agree, it would be a shame that's why I opened this issue. I did some few optimizations in my own crate and I'm mostly happy with its performance. The step taking the longest by far is the actual parsing :)
from pdbtbx.
I ran a quick flamegraph analysis on a simple executable which does nothing but opening a PDB file. Apparently, the vast majority of time is spent within the chain::add_atom
function. Why exactly that is, I don't know (yet)
from pdbtbx.
Nice! I would assume (definitively needs some measurements) that it takes a long time to find the correct residue + conformer to add to. If this is the case I think starting from the end of the list with the search for the correct res + conf is faster.
Also I think that the 1HTQ
stresses the validation a lot because pdbtbx thinks all atoms from all models are wrong.
from pdbtbx.
I investigated some more.
The binary I was looking at contains just this code:
fn main() {
let _a = pdbtbx::open_pdb("large.pdb", pdbtbx::StrictnessLevel::Strict);
}
The PDB file is the same as in the pdbtbx
example files which I used to test for files with more than 100k atoms. I've never profiled anything before so I don't really know much about what I'm doing but I used perf
to take recordings from running this binary and determined (as with the flamegraph before) that most of the time is spent within the chain::add_atom
method.
Looking at the generated assembler code, I got this:
0,19 │ c0: add rbp,0x38
0,05 │ add r12,0xffffffffffffffc8
│ ↓ je 114
1,00 │ ca: cmp QWORD PTR [rbp+0x0],rbx
98,60 │ ↑ jne c0
0,01 │ mov rdi,QWORD PTR [rbp+0x8]
│ test r13,r13
│ setne al
I have to confess that I don't know anything about assembly but the hottest instruction is jne
which is a "jump if not equal" one. If you happen to know more, I'd love to hear your thoughts.
I reasoned that, according to this finding, the bottleneck was the evaluation of one of the if
statements in the method. I commented out some stuff, recompiled, ran the binary again and found that the code evaluation that takes most of the time (by a large margin) is the two lines within the second if
statement:
if !found {
self.residues.push(new_residue);
current_residue = self.residues.last_mut().unwrap();
}
Also, the bottleneck is not the evaluation of found
but rather the instructions within the body. Since these two depend on one another and/or other things, I couldn't narrow it down further for the time being.
This shows, however, that the for
loop determining whether the current residue is already present is, in fact, not the problem, unless I'm much mistaken here.
Eliminating the if
statement (just for testing purposes) cuts the run time of the binary by some >80% and the next most expensive calculations come from lex_atom_basics
and parse_number
from pdbtbx.
Okay, I dug a little more into profiling the performance and I think I've come to the bottom of the issue now. What I wrote earlier seems to have been an incorrect assessment of what the issue was.
I performed a couple runs with perf
on a binary compiled from this:
fn main() {
let _a = pdbtbx::open_pdb("large.pdb", pdbtbx::StrictnessLevel::Strict);
}
Then I recorded a profile with perf record --call-graph=dwarf ./target/release/pdbtest
and looked at it with perf report
. Looking at the call stack, the conclusion remains that the chain::add_atom
method is the culprit. However it appears that it, in turn, calls this: core::tuple::<impl core::cmp::PartialEq for (A,B)>::eq (inlined)
which makes up some 88% of the time spent during run time.
So it seems clear that the bottleneck is the comparison if residue.id() == residue_id
within the add_atom
method.
Since the comparison has already been inlined by the compiler, I'm not sure how to improve the situation just yet but I wanted to share the information.
If you have any criticisms or suggestions, I'm all ears!
P.S. Removing everything in the loop body except current_residue = residue
cuts the run time by some 80% or so which seems to support my assessment.
from pdbtbx.
I created a PR that adds some clippy lints and the reversal of iteration that you mentioned. Turns out your assessment of what the bottleneck is, was exactly right. :)
With this change the tuple comparison mentioned above is still the major bottleneck but it's less pronounced. If you have any ideas how to increase performance even more, I'm interested...
#82
from pdbtbx.
The comparison is done when searching for the right residue when inserting a new one. By going from the back in most cases ((n-1)/n where n is the average number of atoms in a residue) the residue is found immediately, but if the residue has not been added yet it still has to go through the whole list. If we assume that no residue is ever defined with another residue in between, the best way forward would be the same trick as chains in the parser. Keep the new residue local in the parser and add all atoms until a new residue is found. At this point the residue has to be added to the current chain and a new one has to be made.
So that would be a huge improvement as well I assume, but I am not entirely sure this assumption holds :/.
If it does not hold some kind of set made for constant lookup time with just the residue ids could be beneficial by removing the need for the full comparison loop.
from pdbtbx.
I was thinking along the same lines just now (I think). If I understand correctly, each time an atom is to be added (for large.pdb
that would be more than 100k times) the entire current chain is traversed in search for a residue to add the atom to.
So each time an atom of a new residue is added, the whole chain is searched in vain which would amount to a lot of pointless searching. I hope I got that right, I tried to use my own words for it :D
My suggestion would be to introduce either a set holding the already present residue IDs or do some sort of memoization via a HashMap to store residues that have already been added.
Either way, this should reduce lookup times for residue IDs rather drastically. Probably the set approach would be easier to do, I have already used sets with success in my own pdbman
project to reduce the traversal of Vec
s.
from pdbtbx.
Yes that is exactly what I was thinking about. I think that changing the internal storage from Vec
to some other structure is justified for this cause. That would probably also improve performance in some other cases.
from pdbtbx.
I would suggest using an IndexSet
from the indexmap
crate (https://docs.rs/indexmap/latest/indexmap/set/struct.IndexSet.html). It's an ordered set which has implementations for many standard methods for Vec
s but features constant-time lookups. I think this should be a good fit because uniqueness is a desired property.
Do you think, the whole hierarchy of structs, Model, Chain, Residue, Conformer, Atom which is now stored in Vec
s on each level should be converted to sets?
from pdbtbx.
A possible pitfall (which I've previously come across) is that storing Residue
s, Atom
s and the like in HashSet
s or IndexSet
s won't work out of the box because they all incorporate f64
fields which don't implement Hash
. A possible solution here would be using OrderedFloat
s from here (https://docs.rs/ordered-float/latest/ordered_float/struct.OrderedFloat.html).
I suppose all of this would require some major refactoring but I think it should lead to significant speedups during parsing.
from pdbtbx.
Implementing Hash
is doable actually. The equality is defined by the serial_numbers/names for each level so for floats are involved in that. Creating a commit implementing Hash
for everything.
Edit not so sure anymore if this is correct.
from pdbtbx.
I had a look at this myself and came to the conclusion that implementing Hash
on Atom
(which is the only one that can't just be derived) only works if you exclude all of the fields that are floats.
For example one could hash a tuple of name, serial number and so on but not including the position, occupancy and b_factor etc.
It's only important that this holds: k1 == k2 -> hash(k1) == hash(k2)
so we won't have logic errors. So if the Hash
implementation is less strict (because it doesn't consider a few fields) than the Eq
one, it should be fine. The only question would then be if this is a good idea.
Other than that we could, in fact, use OrderedFloat
s from that crate but I'm not sure if that's feasible because we would have to implement Serde
and rstar
traits by hand which, I think, is not possible since neither the trait nor the type are local to this crate.
from pdbtbx.
Additional note: I just realized that using sets as drop-in replacement for the Vec
s that are currently used needs some more thought because creating a mutable reference to a set item is not possible. It should be possible to create a workaround but that may come at additional cost, especially if the order of the set is to be preserved.
from pdbtbx.
Okay I tried some things out myself, including the very nice IndexMap
you pointed out earlier. And I think it will prove to be very hard to keep a set of some sort in permanent use, as it will be nearly impossible to keep all data in sync especially if you want to keep the opportunity for people to change the id of residues. So I think the approach you took in #86 has the most potential. Here you create a set for single usage, only while parsing, removing many possibilities for out of sync data. The implementation then could be in the form as you did, an extra function which takes the HashSet
of already added Residues
. Or there could be an intermediate HashMap
while parsing which is turned into a Chain
as soon as a new chain is detected.
from pdbtbx.
What do you think of the approach with a global struct holding the HashSet
s as in PR #87 ?
from pdbtbx.
I think it has the same pitfalls as storing it within the struct, namely that the information is retained when the user can modify the whole structure. I think that keeping the set/map alive can only be done when there is a way to automatically update the key when the user has had mutable access to the internal layer. This would entail designing a new collection type with this behaviour. Which as soon as a mutable access to an inner type is dropped will automatically update the key that is used from that structure. I think this is possible, but just a lot of work which may not be worth our time. I will push my own try to solve this issue in a separate branch shortly and would be very interested in what you think about that approach.
from pdbtbx.
Sure, looking forward to it!
from pdbtbx.
Related Issues (20)
- Panic while parsing HOT 3
- Unable to load poor-quality PDBs HOT 6
- Allow for the use of compressed files HOT 4
- Support pdbML HOT 2
- Add SASA support HOT 14
- Panic in `save/pdb.rs:517:29` at `Option::unwrap()` on a `None` value HOT 3
- Issues with `remove_models_except` implementation HOT 2
- Crates.io new version Request HOT 1
- Bad handling of PDB files containing mixed upper and lower case chains HOT 3
- Maybe make a Discord? HOT 2
- Bad Models in PDB HOT 1
- ReadOptions
- Utilities? HOT 6
- REMARK Line Length Enforcement HOT 2
- Fall back strategy for headerless PDB files. HOT 2
- Read option to only parse `ATOM` records HOT 5
- Finish implementing the `ReadOptions`
- Customize `StrictnessLevel` HOT 9
- Refactor long functions
- Check memory footprint
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdbtbx.