Comments (12)
Support for AminoAcid or some kind of Residue class is something I have been thinking about for a long time.
I'd like to have it, but I see two issues to resolve:
- how do we know which atoms are in which residue? For well-formed PDB that is easy, but for other formats, it is harder. Some kind of heuristic could be used, but might be complex to implement.
- how do we define the API so that it stays easy to use for non-biochemistry files?
I am on the phone right now, so I'll expand on that next week, when I get back home.
from chemfiles.
Yes. I think both issues imply the same requisite:
- define a non-mandatory Residue class. That is, a class that connects the Frame and Atom classes, but leaving enough room for these classes to interact independently, in case there's no residue info.
There should also be functions to 'pass' the Residue objects from a frame to other frames, so one can read the trajectory of a biomolecule in any of the available formats and also its PDB to get the residue info. I know it seems a bit of a 'hack' but I think is the easiest and safest way to get residue info for trajectories.
I'm not sure of any of this, so please tell me what you think (when you get back). I really want to use chemfiles, so I'll start writing some code. I'm far from being a good programmer but maybe I can write something useful.
from chemfiles.
There should also be functions to 'pass' the Residue objects from a frame to other frames, so one can read the trajectory of a biomolecule in any of the available formats and also its PDB to get the residue info. I know it seems a bit of a 'hack' but I think is the easiest and safest way to get residue info for trajectories.
This is already possible, using the Trajectory::set_topology
function.
define a non-mandatory Residue class. That is, a class that connects the Frame and Atom classes, but leaving enough room for these classes to interact independently, in case there's no residue info.
Yes, I think the Residue
class should live in the Topology
class, which is already the way to connect atomic informations to positions. I started to write a simple version of the Residue class, like this:
class Residue {
public:
/// Constructor
Residue(std::vector<size_t> atoms, size_t id, std::string name = "");
/// Get the number of atoms contained in the residue
size_t size() const {
return atoms_.size();
}
/// Get the indexes of the atoms
const std::vector<size_t>& atoms() const {
return atoms_;
}
/// Get the name of the residue, if any.
const std::string& name() const {
return name_;
}
/// Get the id of this residue
size_t id() const {
return id_;
}
private:
std::vector<size_t> atoms_;
std::string name_;
size_t id_ = static_cast<size_t>(-1);
};
Then, Residue
could be added to the Topology like this:
class Topology {
public:
// ... Existing methods
const std::vector<Residue>& residues() const {
return residues_;
}
bool are_bonded(const Residue& res1, const Residue& res2) const;
private:
// ... Existing members
std::vector<Residue> residues_;
};
Here we have name and id for each residue, list of atoms in the residue, and a way to check if two residues are bonded together. As I do not work with residues at all, what other methods do you think would be useful here?
from chemfiles.
Sorry for taking so long to answer, working on too many projects right now.
Great solution. When I started coding I thought there was no way to integrate the Residue concept without disturbing the library.
I was actually thinking of a template class called "Residue" and 2 derived classes called "AminoAcid" and "Nucleotide". But maybe this is not necessary.
Anyway, I think the Atom class should have a variable indicating the id and the name of residue to which they belong to:
/// Get the id of the corresponding residue
size_t resid() const {
return resid_;
}
/// Set the id of the corresponding residue
void set_resid(size_t resid) {
resid_ = resid;
}
/// Get the name of the corresponding residue
std::string resname() const {
return name_;
}
/// Set the name of the corresponding residue
void set_resname(std::string resid) {
name_ = name;
}
private:
size_t resid_;
std::string resname_;
Both of them could be read from the PDB:
void PDBFormat::read_ATOM(Frame& frame, const std::string& line) {
...
atom.set_resname( trim(line.substr(18, 3)) );
atom.set_resid( std::stoi(line.substr(23, 4)); ); // maybe cast to size_t previously?
...
}
Then one could make a Topology function to itereate over each atom, checking those next to each other that have the same resid and "build" the residue sequence. Such function would only be used once for each trajectory and then the resulting Topology applied to the rest of the frames.
size_t CurrentResid, PrevResid = atoms_.begin()->resid();
std::string CurrentResname, PrevResname=atoms_.begin()->resname();
std::vector<Atom> NewResidueAtoms;
iterate over all atoms {
CurrentResid = AtomIterator->resid();
CurrentResname = AtomIterator->resname();
if ( PrevResid == CurrentResid) { NewResidueAtoms.push_back(*AtomIterator); }
else { Residue(NewResidueAtoms, PrevResid, PrevResname); }
PrevResid = CurrentResid;
PrevResname = CurrentResname;
}
Residue(NewResidueAtoms, PrevResid, PrevResname);
I'm not sure about this. It's probably just nonsense, haha. Perhaps you can come up with something better.
Other additions to Topology would be:
iterator begin() {return residues_.begin();}
const_iterator begin() const {return residues_.begin();}
const_iterator cbegin() const {return residues_.cbegin();}
iterator end() {return residues_.end();}
const_iterator end() const {return residues_.end();}
const_iterator cend() const {return residues_.cend();}
private:
size_t nresidues_; // number of residues in the topology
Iterators over Residue atoms are also useful. This way one could iterate over the atoms of specific residues, instead of iterating over every atom of the Frame.
Chain id could also be read. An AminoAcidType variable for distinguish between regular and coarse grained amino acids could also be useful (coarse grain MD of macromolecules is becoming popular). But these could be added later, if you so desire.
That's all I can think of. I don't think there's much more to add for an IN/OUT library. For example, the OpenBabel OBResidue class doesn't go far from here.
Again, I'm not sure the way to construct the Residue objects that I'm proposing is the most appropiate.
Hope some of this is useful to you.
from chemfiles.
Anyway, I think the Atom class should have a variable indicating the id and the name of residue to which they belong to:
I do not like reversing the knowledge: an Atom should not have knowledge of which Residue it lives in, a Residue should know which Atom it contains.
Looking for the Residue containing a specific atom can be provided by the topology, with a const Residue& Topology::residue(size_t atom_id) const
function.
Such function would only be used once for each trajectory and then the resulting Topology applied to the rest of the frames.
This can not be done, because some kind of trajectory (for example the GROMACS new format: TNG) have support for grand-cannonical simulation, where the topology changes during the simulation.
Other additions to Topology would be:
iterator begin() {return residues_.begin();} const_iterator begin() const {return residues_.begin();} const_iterator cbegin() const {return residues_.cbegin();} iterator end() {return residues_.end();} const_iterator end() const {return residues_.end();} const_iterator cend() const {return residues_.cend();}
All these iterators could be accessed using the Topology::residues
function, make thinks like this possible:
for (auto& residue: topology.residues()) {
// do stuff
}
size_t nresidues_; // number of residues in the topology
This is Topology::residues_.size()
.
Iterators over Residue atoms are also useful. This way one could iterate over the atoms of specific residues, instead of iterating over every atom of the Frame.
Yes, this would be nice!
Again, I'm not sure the way to construct the Residue objects that I'm proposing is the most appropriate.
Your implementation would work for PDB, using a temporary map atom id
<=> res id
, and then adding the residues to the topology. Something like:
void PDBFormat::read_ATOM(Frame& frame, const std::string& line) {
std::map<size_t, Residue> residues;
...
resname = trim(line.substr(18, 3));
resid = std::stoi(line.substr(23, 4));
res = residues[resid] || Residue(); // Get or create the residue with this id
res.add_atom(current_atom);
...
for (auto& it: residues) {
// transfer residues from the map to the topology
topology.add_residue(it.second);
}
...
}
from chemfiles.
So an initial version with PDB support is now merged in master and will make it to the next release!
See #38 for improvements to this feature, and please report any enhancement you can think of!
from chemfiles.
I've been using the residue implementation and just realized I overlooked something when you showed me the initial implementation.
One can iterate over residues and then iterate over each residue's atoms, but you can't get the positions from the atoms. Neither can you get the indices of the atoms you are iterating over.
A intuitive way to solve this would be to modify the Atom class, to include this information.
private:
size_t id_ = static_cast<size_t>(-1);
Vector3D position_; // to diferentiate position_ from positions_
Add functions to access this variables as const reference
, and then modify Atom
constructor to include id_
and position_
. Then, modify the PDB format to construct each atom with this info. And probably more stuff that I can't think of right now.
But maybe you never intended to add position info in the Atom
class (since you made the positions_
variable in the Frame
class); if that's the case, the id_
will do. Once I have the id, I can plug it into the positions()
function of the Frame
class. Although, I must say, this seems a bit cumbersome.
from chemfiles.
You already have the indices, as the number returned by iteration over a residue are the indices in the corresponding Frame/Topology. So you can do something like this:
auto frame = file.read();
auto positions = frame.positions();
for (auto& residue: frame.topology().residues()) {
for (size_t i: residue) {
auto& position = positions[i];
// use the position
}
}
Although, I must say, this seems a bit cumbersome.
I agree. The two advantage of having the positions/velocities separated from the Atom are:
- The C API (and thus all languages bindings) is easier to use, as there would be no way to access all positions in a single call, and one would have to create a copy of every atom at each step to extract the data;
- The code can be more efficient, only reading the positions/velocities and leaving the whole Topology unchanged.
But there is a reason why chemfiles is still in beta! If this design is really too cumbersome to use, we can change it to something better, while keeping in mind the other constraints.
from chemfiles.
You already have the indices, as the number returned by iteration over a residue are the indices in the corresponding Frame/Topology.
Great. My mistake.
As for the advantages of not including positions/velocities in the Atom
class, they seem quite convincing. id_
for the Atom
class could still be included though. It won't change during the run and would allow a more intuitive way to access atoms coordinates.
from chemfiles.
I have another annoying remark to make.
I think the current name_
variable should be though more as element_
and a new name_
variable added. Atom names in PDB format not only indicate the element, but also the position in the residue. So, for example, the name_
of the last hydrogen in glutamine would be "HE22", and the element_
would be "H". The name_
would be readed from columns 13-16, and the element_
from columns 78-80 in the PDB format.
In case the format does not hold any element_
information, it could be assigned the same contents of the name_
variable.
from chemfiles.
I have another annoying remark to make.
Go on, feedback is very appreciated!
I think the current name_ variable should be though more as element_ and a new name_ variable added. Atom names in PDB format not only indicate the element, but also the position in the residue. So, for example, the name_ of the last hydrogen in glutamine would be "HE22", and the element_ would be "H". The name_ would be readed from columns 13-16, and the element_ from columns 78-80 in the PDB format.
Yes, this could be nice to have. I would use label_
and element_
but I agree on the idea. Would you have the time to implement this? In any case, feel free to open an issue about this to have a record of the work to do.
from chemfiles.
id_ for the Atom class could still be included though. It won't change during the run and would allow a more intuitive way to access atoms coordinates.
I am more reticent with this: what would be the use case where you have the atom but not the id? My problem is that an Atom should not know about how it is being used, and where it is being stored. More generally, what would be the id of an atom created outside of a Frame? This is perfectly legal now:
auto atom = Atom("Zn");
auto mass = atom.mass();
auto radius = atom.vdw_radius();
// Use this data for an analysis algorithm
from chemfiles.
Related Issues (20)
- N-terminal hydrogens not listed in PDB connectivity HOT 2
- Update codecove uploader in CI
- Tracking issue for release 0.10.2 HOT 1
- Unit change during GROMACS reading. HOT 2
- Add a frame property indicating if coordinates are unwrapped HOT 6
- Multi-threaded trajectory reading fails occasionally HOT 2
- Incorrect unit cell while writing to xyz file HOT 5
- Mechanism to pass arbitrary metadata to readers/writers HOT 1
- Tracking issue for release 0.10.3 HOT 1
- Add support for symbolic Atom types in LAMMPS
- "mmap failed" for binary formats on WSL1 HOT 9
- In Chemfiles.jl output trajectory file has a compact format and thus loses precision
- TPR file read error HOT 3
- Tracking issue for release 0.10.4 HOT 1
- [feature request] Feasibility of reading angle dihedral and improper HOT 10
- the version of chemfiles does not match chemfiles-lib and chemfiles-python in conda-forge HOT 3
- XTC format change for very large systems HOT 1
- Fail on loading LAMMPS Data file with non-empty blank lines HOT 7
- LAMMPS Data format exporter might require a blank line HOT 2
- Can not load QM9's xyz file HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chemfiles.