I'm wondering if there's some space for an AminoAcid class. Objects

Yes. I think both issues imply the same requisite: define a no

Support for biological molecules,about chemfiles/chemfiles

Comments (12)

Luthaf commented on June 18, 2024

Support for AminoAcid or some kind of Residue class is something I have been thinking about for a long time.

I'd like to have it, but I see two issues to resolve:

how do we know which atoms are in which residue? For well-formed PDB that is easy, but for other formats, it is harder. Some kind of heuristic could be used, but might be complex to implement.
how do we define the API so that it stays easy to use for non-biochemistry files?

I am on the phone right now, so I'll expand on that next week, when I get back home.

from chemfiles.

pgbarletta commented on June 18, 2024

Yes. I think both issues imply the same requisite:

define a non-mandatory Residue class. That is, a class that connects the Frame and Atom classes, but leaving enough room for these classes to interact independently, in case there's no residue info.

There should also be functions to 'pass' the Residue objects from a frame to other frames, so one can read the trajectory of a biomolecule in any of the available formats and also its PDB to get the residue info. I know it seems a bit of a 'hack' but I think is the easiest and safest way to get residue info for trajectories.
I'm not sure of any of this, so please tell me what you think (when you get back). I really want to use chemfiles, so I'll start writing some code. I'm far from being a good programmer but maybe I can write something useful.

from chemfiles.

Luthaf commented on June 18, 2024

There should also be functions to 'pass' the Residue objects from a frame to other frames, so one can read the trajectory of a biomolecule in any of the available formats and also its PDB to get the residue info. I know it seems a bit of a 'hack' but I think is the easiest and safest way to get residue info for trajectories.

This is already possible, using the Trajectory::set_topology function.

define a non-mandatory Residue class. That is, a class that connects the Frame and Atom classes, but leaving enough room for these classes to interact independently, in case there's no residue info.

Yes, I think the Residue class should live in the Topology class, which is already the way to connect atomic informations to positions. I started to write a simple version of the Residue class, like this:

class Residue {
public:
    /// Constructor
    Residue(std::vector<size_t> atoms, size_t id, std::string name = "");

    /// Get the number of atoms contained in the residue
    size_t size() const {
        return atoms_.size();
    }

    /// Get the indexes of the atoms
    const std::vector<size_t>& atoms() const {
        return atoms_;
    }

    /// Get the name of the residue, if any.
    const std::string& name() const {
        return name_;
    }

    /// Get the id of this residue
    size_t id() const {
        return id_;
    }

private:
    std::vector<size_t> atoms_;
    std::string name_;
    size_t id_ = static_cast<size_t>(-1);
};

Then, Residue could be added to the Topology like this:

class Topology {
public:
    // ... Existing methods
    const std::vector<Residue>& residues() const {
        return residues_;
    }

    bool are_bonded(const Residue& res1, const Residue& res2) const;

private:
    // ... Existing members

    std::vector<Residue> residues_;
};

Here we have name and id for each residue, list of atoms in the residue, and a way to check if two residues are bonded together. As I do not work with residues at all, what other methods do you think would be useful here?

from chemfiles.

pgbarletta commented on June 18, 2024

Sorry for taking so long to answer, working on too many projects right now.

Great solution. When I started coding I thought there was no way to integrate the Residue concept without disturbing the library.
I was actually thinking of a template class called "Residue" and 2 derived classes called "AminoAcid" and "Nucleotide". But maybe this is not necessary.

Anyway, I think the Atom class should have a variable indicating the id and the name of residue to which they belong to:

/// Get the id of the corresponding residue
    size_t resid() const {
        return resid_;
    }
/// Set the id of the corresponding residue
    void set_resid(size_t resid)  {
        resid_ = resid;
    }
/// Get the name of the corresponding residue
    std::string resname() const {
        return name_;
    }
/// Set the name of the corresponding residue
    void set_resname(std::string resid)  {
        name_ = name;
    }
private:
size_t resid_;
std::string resname_;

Both of them could be read from the PDB:

void PDBFormat::read_ATOM(Frame& frame, const std::string& line) {
...
atom.set_resname( trim(line.substr(18, 3)) );
atom.set_resid( std::stoi(line.substr(23, 4)); ); // maybe cast to size_t previously? 
...
}

Then one could make a Topology function to itereate over each atom, checking those next to each other that have the same resid and "build" the residue sequence. Such function would only be used once for each trajectory and then the resulting Topology applied to the rest of the frames.

size_t CurrentResid, PrevResid = atoms_.begin()->resid();
std::string CurrentResname, PrevResname=atoms_.begin()->resname();
std::vector<Atom> NewResidueAtoms;

iterate over all atoms                     {
      CurrentResid = AtomIterator->resid();
      CurrentResname = AtomIterator->resname();
      if ( PrevResid == CurrentResid)  { NewResidueAtoms.push_back(*AtomIterator); }
      else { Residue(NewResidueAtoms, PrevResid, PrevResname); }
      PrevResid = CurrentResid;
      PrevResname = CurrentResname;
}
Residue(NewResidueAtoms, PrevResid, PrevResname);

I'm not sure about this. It's probably just nonsense, haha. Perhaps you can come up with something better.

Other additions to Topology would be:

iterator begin() {return residues_.begin();}
const_iterator begin() const {return residues_.begin();}
const_iterator cbegin() const {return residues_.cbegin();}
iterator end() {return residues_.end();}
const_iterator end() const {return residues_.end();}
const_iterator cend() const {return residues_.cend();}

private:
size_t nresidues_; // number of residues in the topology

Iterators over Residue atoms are also useful. This way one could iterate over the atoms of specific residues, instead of iterating over every atom of the Frame.

Chain id could also be read. An AminoAcidType variable for distinguish between regular and coarse grained amino acids could also be useful (coarse grain MD of macromolecules is becoming popular). But these could be added later, if you so desire.

That's all I can think of. I don't think there's much more to add for an IN/OUT library. For example, the OpenBabel OBResidue class doesn't go far from here.
Again, I'm not sure the way to construct the Residue objects that I'm proposing is the most appropiate.
Hope some of this is useful to you.

from chemfiles.

Luthaf commented on June 18, 2024

Anyway, I think the Atom class should have a variable indicating the id and the name of residue to which they belong to:

I do not like reversing the knowledge: an Atom should not have knowledge of which Residue it lives in, a Residue should know which Atom it contains.

Looking for the Residue containing a specific atom can be provided by the topology, with a const Residue& Topology::residue(size_t atom_id) const function.

Such function would only be used once for each trajectory and then the resulting Topology applied to the rest of the frames.

This can not be done, because some kind of trajectory (for example the GROMACS new format: TNG) have support for grand-cannonical simulation, where the topology changes during the simulation.

Other additions to Topology would be:

iterator begin() {return residues_.begin();}
const_iterator begin() const {return residues_.begin();}
const_iterator cbegin() const {return residues_.cbegin();}
iterator end() {return residues_.end();}
const_iterator end() const {return residues_.end();}
const_iterator cend() const {return residues_.cend();}

All these iterators could be accessed using the Topology::residues function, make thinks like this possible:

for (auto& residue: topology.residues()) {
    // do stuff
}

size_t nresidues_; // number of residues in the topology

This is Topology::residues_.size().

Iterators over Residue atoms are also useful. This way one could iterate over the atoms of specific residues, instead of iterating over every atom of the Frame.

Yes, this would be nice!

Again, I'm not sure the way to construct the Residue objects that I'm proposing is the most appropriate.

Your implementation would work for PDB, using a temporary map atom id <=> res id, and then adding the residues to the topology. Something like:

void PDBFormat::read_ATOM(Frame& frame, const std::string& line) {
std::map<size_t, Residue> residues;
...

resname = trim(line.substr(18, 3));
resid = std::stoi(line.substr(23, 4));

res = residues[resid] || Residue(); // Get or create the residue with this id
res.add_atom(current_atom);

...

for (auto& it: residues) {
    // transfer residues from the map to the topology
    topology.add_residue(it.second);
}

...

}

from chemfiles.

Luthaf commented on June 18, 2024

So an initial version with PDB support is now merged in master and will make it to the next release!

See #38 for improvements to this feature, and please report any enhancement you can think of!

from chemfiles.

pgbarletta commented on June 18, 2024

I've been using the residue implementation and just realized I overlooked something when you showed me the initial implementation.
One can iterate over residues and then iterate over each residue's atoms, but you can't get the positions from the atoms. Neither can you get the indices of the atoms you are iterating over.

A intuitive way to solve this would be to modify the Atom class, to include this information.

private:
size_t id_ = static_cast<size_t>(-1);
Vector3D position_; // to diferentiate position_ from positions_

Add functions to access this variables as const reference, and then modify Atom constructor to include id_ and position_. Then, modify the PDB format to construct each atom with this info. And probably more stuff that I can't think of right now.

But maybe you never intended to add position info in the Atom class (since you made the positions_ variable in the Frame class); if that's the case, the id_ will do. Once I have the id, I can plug it into the positions() function of the Frame class. Although, I must say, this seems a bit cumbersome.

from chemfiles.

Luthaf commented on June 18, 2024

You already have the indices, as the number returned by iteration over a residue are the indices in the corresponding Frame/Topology. So you can do something like this:

auto frame = file.read();
auto positions = frame.positions();
for (auto& residue: frame.topology().residues()) {
    for (size_t i: residue) {
        auto& position = positions[i];
        // use the position
    }
}

Although, I must say, this seems a bit cumbersome.

I agree. The two advantage of having the positions/velocities separated from the Atom are:

The C API (and thus all languages bindings) is easier to use, as there would be no way to access all positions in a single call, and one would have to create a copy of every atom at each step to extract the data;
The code can be more efficient, only reading the positions/velocities and leaving the whole Topology unchanged.

But there is a reason why chemfiles is still in beta! If this design is really too cumbersome to use, we can change it to something better, while keeping in mind the other constraints.

from chemfiles.

pgbarletta commented on June 18, 2024

You already have the indices, as the number returned by iteration over a residue are the indices in the corresponding Frame/Topology.

Great. My mistake.

As for the advantages of not including positions/velocities in the Atom class, they seem quite convincing. id_ for the Atom class could still be included though. It won't change during the run and would allow a more intuitive way to access atoms coordinates.

from chemfiles.

pgbarletta commented on June 18, 2024

I have another annoying remark to make.

I think the current name_ variable should be though more as element_ and a new name_ variable added. Atom names in PDB format not only indicate the element, but also the position in the residue. So, for example, the name_ of the last hydrogen in glutamine would be "HE22", and the element_ would be "H". The name_ would be readed from columns 13-16, and the element_ from columns 78-80 in the PDB format.
In case the format does not hold any element_ information, it could be assigned the same contents of the name_ variable.

from chemfiles.

Luthaf commented on June 18, 2024

I have another annoying remark to make.

Go on, feedback is very appreciated!

I think the current name_ variable should be though more as element_ and a new name_ variable added. Atom names in PDB format not only indicate the element, but also the position in the residue. So, for example, the name_ of the last hydrogen in glutamine would be "HE22", and the element_ would be "H". The name_ would be readed from columns 13-16, and the element_ from columns 78-80 in the PDB format.

Yes, this could be nice to have. I would use label_ and element_ but I agree on the idea. Would you have the time to implement this? In any case, feel free to open an issue about this to have a record of the work to do.

from chemfiles.

Luthaf commented on June 18, 2024

id_ for the Atom class could still be included though. It won't change during the run and would allow a more intuitive way to access atoms coordinates.

I am more reticent with this: what would be the use case where you have the atom but not the id? My problem is that an Atom should not know about how it is being used, and where it is being stored. More generally, what would be the id of an atom created outside of a Frame? This is perfectly legal now:

auto atom = Atom("Zn");

auto mass = atom.mass();
auto radius = atom.vdw_radius();

// Use this data for an analysis algorithm

from chemfiles.

Support for biological molecules about chemfiles HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent