When calculating get_z_surface undefined values in bi

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thank you for your detailed and clear explanation, <a class="user-mention notranslate"

Sorry <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

I'm very impressed by your detailed analysis <a class="user-mention notranslate" data-

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Undefined values in bins occupied by insertion in membrane. about membrane-curvature HOT 7 OPEN

mdanalysis commented on May 27, 2024

Undefined values in bins occupied by insertion in membrane.

from membrane-curvature.

Comments (7)

orbeckst commented on May 27, 2024

pandas is slow, use numpy/scipy!

In GridDataFormats we have the option to interpolate on a grid https://github.com/MDAnalysis/GridDataFormats/blob/be6132ac13041390a880061e4e873044b6c29573/gridData/core.py#L311 ; this is not the cleanest code but perhaps useful to initially look at.

from membrane-curvature.

lilyminium commented on May 27, 2024

Commenting here instead of the PR so that discussion is in one place: I'm not sure that this is a good idea. There is no membrane where the protein is, so there shouldn't be any membrane curvature there either.

from membrane-curvature.

ojeda-e commented on May 27, 2024

Thanks @lilyminium .

I was considering having the interpolation optional and adding a warning message when np.nans are simply too much. I didn't add anything in base.py yet, this is the function only. Here are three reasons to consider.

If there are np.nans in the region of the protein, the adjacent values will be also np.nans. It means we may lose information around the protein. Yes, in the region of the protein we already lost data. But keeping the nans means the resulting array will add np.nans to all the adjacent cells. To make this statement visual, here is an example of what happens when that region is not interpolated, and how the calculation of the gradients spread nans. See for instance how a single undefined value ends up in an extended region of np.nan in the bottom left corner. Same happens in all the cells that delineates the protein.

With no interpolation, contours won't happen. plt.contours doesn't work if the data has np.nans. It will limit the user to use plt.imshow.
Even in the case when we say, ok, fine, let's go with plt.imshow instead, the np.nans remain problematic unless you opt for a None interpolation. Interpolations are not only more pleasant to read, but also make results clearer. Almost all the interpolation methods are based on gradients. And again, gradient of np.nan is more np.nan (annoying np.nans). To visualize this limitation, let's see the plt.imshow plot of Gaussian curvature looks like using with different types of interpolation:

Even with the best-behaved interpolations (marked with cyan rectangles), there is a potential loss of information, more or less significant, depending on the interpolation method implemented by plt.imshow.

In summary, untreated undefined values may result in extended regions of np.nans. Which ultimately results in a considerable loss of information.

The user will have an option to interpolate the np.nans. A warning of " You have too many nans and your results will be driven by undefined values, check your input" kind of thing can pop when:

the user overestimates the number of n_x_bins, n_y_bins in the grid,
or
the protein (or any object inserted) occupies too many n_x_bins, n_y_bins in the cell (because either the section of the protein in the membrane is huge, or either because the user underestimated the number of n_x_bins, n_y_bins in the grid).

But in any case, and in the way I see the solution to calculate curvature, the option of interpolation has to be provided. I was considering an option to mask the protein, but I won't elaborate on it (unless I am asked to) and will leave it as an enahcement.

from membrane-curvature.

lilyminium commented on May 27, 2024

Thank you for your detailed and clear explanation, @ojeda-e. However, I think I must be missing something, because I don't understand how the protein causes that nan value on the edge. How big is the membrane relative to grid you've illustrated?

I would think that the more flexible solution is to return all the data and let users make their own choices about how they want to interpolate and visualize the grid.

In summary, untreated undefined values may result in extended regions of np.nans. Which ultimately results in a considerable loss of information.

Is this still true with your current code that uses nanmean to calculate the mean curvatures?

from membrane-curvature.

ojeda-e commented on May 27, 2024

Sorry @lilyminium, my bad for not clarifying.

My intention with the illustration was to show the same effect (spread of undefined values) due to separate effects:

The effect of an np.nan when it is at at the edge (the ones in the squares), and
the effect of a group of np.nans (coming from the embedded protein).

In the first case is easier to identify how a single np.nan spreads over the calculation of curvature. But 1 and 2 are not correlated in any fashion with those because of the insertion in the membrane (protein). :)

Is this still true with your current code that uses nanmean to calculate the mean curvatures?

You mean, if even by using nanmean we still may have a considerable loss of information?

My answer is: I think yes, it can happen, but it depends on several factors. For example, in systems with extremely short simulation time (low n_frames == bad idea); (and/or) if the user overestimates bins and during the n_frames not even one lipid populates. It can even depend on the lipid composition. For example. If I have a membrane with (let's be extra) 90:10 CHOL:PC. We would have only 1/10 of the total number of lipids in the system to use as a reference to derive the surface. Then it will be hard to fill all the cells in the grid (but again, it also depends on how big the unit cells are).
In these cases, there may be space for a case in which a full array is nans. It's unlikely but I think it can happen.

Edit: Maybe we can have some opinions from @MDAnalysis/gsoc-mentors ?

from membrane-curvature.

hmacdope commented on May 27, 2024

I'm very impressed by your detailed analysis @ojeda-e! In general my approach is that a np.nan is a np.nan and so be it. You can give the user the option of using some kind of method to get rid of them with something like a new_data = old_data.interpolate(kernel='gaussian'). I (try) to view np.nan as a friend rather than an enemy as it stops you getting absolute rubbish and not realising it as you might under some other circumstances.

from membrane-curvature.

ojeda-e commented on May 27, 2024

Thanks, @hmacdope.
I agree with the general approach of np.nan is np.nan. However, in the calculation of curvature, there are two serious limitations, that when combined, the problem gets more complicated.
One is sampling. We know that. We know that we need i) a high amount of points to derive a surface, and ii) a high number of frames.
The second is the definition of curvature itself. Curvature is defined by second partial derivatives. As it is in curvature.py.
The equations are

Now, the fact that we calculate second partial derivatives, means that we calculate gradient twice. In this example, we can see the effect of only one np.nan in the calculation of first and second gradients.

These are some of the reasons why calculating the curvature of biological membranes can get extremely hard. The formalism asks to have second partial derivatives. Gradients are calculated as differences between adjacent values. Now, add to this rigid formalism that we depend on the lipid composition of the membrane. So, if have a bilayer with 1:1:1 phospholipid_1:phospholipid_2:other_lipid, where other_lipid is any type of lipid that has a flip-flop rate high enough that we have several flip-fllop events in the interval we are averaging, well, we lose 1/3 of the elements used to derive the surface. Now, imagine that, on top of all the above you had a simulation setup small enough that your protein covers lots of unit cells.
You'll get np.nans. If you don't treat them, the calculated curvature, according to the equations shown, will be meaningless.

In this context, and how I see the problem, having the option to interpolate is necessary. The possibility of abusers of interpolation methods, that's another story. That the user is calculating curvature in a simulation setup that is not suitable to calculate curvature, that's another story too. The user will be more than welcome to turn interpolation off. But it should be offered.

from membrane-curvature.

Undefined values in bins occupied by insertion in membrane. about membrane-curvature HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent