Git Product home page Git Product logo

grispy's Introduction

GriSPy (Grid Search in Python)

logo

PyPi Version Build Status Documentation Status Coverage Status License: MIT Python 3.6+ PyPI downloads

ascl:1912.013 arXiv https://github.com/leliel12/diseno_sci_sfw

GriSPy is a regular grid search algorithm for quick nearest-neighbor lookup.

This class indexes a set of k-dimensional points in a regular grid providing a fast aproach for nearest neighbors queries. Optional periodic boundary conditions can be provided for each axis individually.

GriSPy has the following queries implemented:

  • bubble_neighbors: find neighbors within a given radius. A different radius for each centre can be provided.
  • shell_neighbors: find neighbors within given lower and upper radius. Different lower and upper radius can be provided for each centre.
  • nearest_neighbors: find the nth nearest neighbors for each centre.

Usage example

Let's create a 2D random distribution of points as an example:

import numpy as np
import grispy as gsp

data = np.random.uniform(size=(1000, 2))
grid = gsp.GriSPy(data)

The grid object now has all the data points indexed in a grid. Now let's search for neighbors around new points:

centres = np.random.uniform(size=(10, 2))
dist, ind = grid.bubble_neighbors(centres, distance_upper_bound=0.1)

And that's it! The dist and ind lists contain the distances and indices to data neighbors within a 0.1 search radius.


Requirements

You will need Python 3.6 or later to run GriSPy.

Standard Installation

GriSPy is available at PyPI. You can install it via the pip command

$ pip install grispy

Development Install

Clone this repo and then inside the local directory execute

$ pip install -e .

Citation

If you use GriSPy in a scientific publication, we would appreciate citations to the following paper:

Chalela, M., Sillero, E., Pereyra, L., García, M. A., Cabral, J. B., Lares, M., & Merchán, M. (2020). GriSPy: A Python package for fixed-radius nearest neighbors search. 10.1016/j.ascom.2020.100443.

Bibtex

@ARTICLE{Chalela2021,
       author = {{Chalela}, M. and {Sillero}, E. and {Pereyra}, L. and {Garcia}, M.~A. and {Cabral}, J.~B. and {Lares}, M. and {Merch{\'a}n}, M.},
        title = "{GriSPy: A Python package for fixed-radius nearest neighbors search}",
      journal = {Astronomy and Computing},
     keywords = {Data mining, Nearest-neighbor search, Methods, Data analysis, Astroinformatics, Python package},
         year = 2021,
        month = jan,
       volume = {34},
          eid = {100443},
        pages = {100443},
          doi = {10.1016/j.ascom.2020.100443},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2021A&C....3400443C},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

Full-text: https://arxiv.org/abs/1912.09585

Authors

Martin Chalela (E-mail: [email protected]), Emanuel Sillero, Luis Pereyra, Alejandro Garcia, Juan B. Cabral, Marcelo Lares, Manuel Merchán.

grispy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

grispy's Issues

Integer type for data indexing

k_digit = np.zeros(data.shape, dtype=int)

Right now the default int type for the data is int64 (in most cases). This may be optimized if we choose the int size according to the length of the input data. For example, indices of int32 are sufficient for a length of 10**9 in the input data. This can save some memory for large dimensions.

Advice for applying to high-dimensional data

Hello,

I am experimenting with your module on high-dimensional data, so I modified the tutorial code, but the kernel keeps crashing when I try to use k=8 or more. Here's the code I am running (it's almost identical to the tutorial code):

import numpy as np
import matplotlib.pyplot as plt

from grispy import GriSPy as G

Npoints = 10 ** 1
Ncentres = 2
dim = 8
Lbox = 100.0

np.random.seed(0)
data = np.random.uniform(0, Lbox, size=(Npoints, dim))
centres = np.random.uniform(0, Lbox, size=(Ncentres, dim))

periodic = {n: (0, Lbox) for n in range(dim)}
gsp = G(data, periodic=periodic)
upper_radii = 10.0

# the kernel crashes when executing the following line
bubble_dist, bubble_ind = gsp.bubble_neighbors(
    centres, distance_upper_bound=upper_radii
)

Perhaps I am doing something wrong, so could you advise how/whether I can apply this to high-dimensional data, where k is on the scale of 10'000?

Unbounded memory allocation in bubble_neighbors

First off - thanks so much for GriSPy: it's working very well for my use case (mass manipulations of 3D reconstructions)!
Anyhow, I wanted to let you know that there is some unbounded memory allocation in the bubble_neighbors function that can cause an unexpected hard crash, which I suspect can probably be fixed without too much trouble.

Crashing use case:
I have a 3D reconstruction file with ~20M 3D points, and I want to find all points within {x} distance of a smaller set (~1K) points. (So that I can cut out all other points). Grid-search is a great improvement for me performance-wise on other solutions I'd tried. But if I run the following, where GSP is the initialized grid with ~20M points, and nvm_df is my positions pandas dataframe with the 1K centers to check:
distances, indices = gsp.bubble_neighbors(np.vstack(nvm_df['positions']), distance_upper_bound=8)

I can watch my memory usage climb until a crash (64GB workstation, so this caught me off guard!)

There is a very simple work-around: I can run this same grid search in a loop (between 50-100 centers at a time) and manually keep a running tally of the found indices myself in the loop. In that case memory use cyclically grows and then is returned with each finished gsp.bubble_neighbors invocation. There isn't a memory leak proper, since the loop version works, but the naive version crashing is a surprise since the bubble_neighbors function already accepts an array of centers.

Please let me know if I can further help diagnose! Alternately, if you'd like a PR, I could try to take a look.

indexing large data

Hello, thank you for this interesting work!

I wonder if there is a way to index large data in advance and store it in a file? (I guess pickling is one option)

nearest_neighbors search hanging on particular coordinate

Having a strange issue where the nearest neighbor search gets stuck on a specific point (index 147 in this dataset), but only if that point is included in a set. If I run the nearest neighbor search on only that point, it successfully runs. I've converted the coordinates to float64 as that posed an issue in the past.

The point that is giving an issue is circled in the bottom image, I noticed it is far from all other points compared to all other data points. Not sure if that's related or coincidence.

Index 147 runs by itself but does not run when included with index 146
image
index 147 is far from other points
image

Add a benchmark report functionality

Provide a function that computes time-benchmark stats for GriSPy methods. Something like the plots shown in the paper. This will allow us to easily compare the improvements introduced by new implementations. Time benchmark is the first priority, after that include a memory benchmark report.

API idea

The input should be the arrays of the parameter space over which the time stats should be computed.
For example:

import numpy as np
import grispy

# similar to Figure 4, second row.
Npoints = 10**np.arange(4, 8)
Ncells = 2**np.arange(3, 9)

# Compute the benchmark assuming a default configuration for the rest of the parameters
# Returns a custom pandas data frame with the stats
df = grispy.benchmark(Npoints=Npoints, Ncells=Ncells)  
df.plot(...)  # provide a custom plot method. This plot can be similar to those in the paper 

Optimize the index computation

GriSPy/grispy/core.py

Lines 350 to 354 in e70b75c

def _digitize(self, data, bins):
"""Return data bin index."""
N = len(bins) - 1
d = (N * (data - bins[0]) / (bins[-1] - bins[0])).astype(np.int)
return d

This is a faster way of computing the index:

 def _digitize(self, data, bins): 
     """Return data bin index.""" 
     # allowed indeces with int16: (-32768 to 32767)
     d = ((data - bins[0]) / (bins[1] - bins[0])).astype(np.int16)
     return d 

Shell query distance bounds have inverted symbols

The distance validation should be:
lower_bound <= distance < upper_bound
and not:
lower_bound < distance <= upper_bound

This causes points that are querying themselves to miss themselves since their distance is exactly zero.

GriSPy/grispy/core.py

Lines 1112 to 1119 in 4dbdeea

mask_distances_upper = (
neighbors_distances[i] <= distance_upper_bound[i]
)
mask_distances_lower = neighbors_distances[i][mask_distances_upper]
mask_distances_lower = (
mask_distances_lower > distance_lower_bound[i]
)

Passing grispy to threads/concurrent.futures

I've tested your library, it's brilliant. I've been trying to speed up a process of looking at many points and getting the nearest neighbours,
I've tried to pass the grispy object to each thread/process and I've had a couple of issues, namely the script doesn't run when I try to do that.

Just as a quick check, is there a specific method I should use for parrelleizing calls to a grispy object?

thanks again!

'invalid entry in coordinates array' when including more than 400 coordinate entries

Having a strange issue with a particular dataset, where if I try to run 'gsp = GriSPy(coords)' I get the following error:

ValueError Traceback (most recent call last)
in
----> 1 gsp = GriSPy(coords)
2
3 shell_neighbors = {}
4 for index, coord in enumerate(coords):
5

in init(self, data, N_cells, periodic, metric, copy_data)
14 __attr_validator_metric(self, __attr_metric, self.metric)
15 __attr_validator_copy_data(self, __attr_copy_data, self.copy_data)
---> 16 self.attrs_post_init()

~/tensor_env/lib/python3.8/site-packages/grispy/core.py in attrs_post_init(self)
185 periodic=self.periodic, dim=self.dim_)
186
--> 187 self.grid_, self.k_bins_ = self._build_grid(
188 data=self.data,
189 N_cells=self.N_cells,

~/tensor_env/lib/python3.8/site-packages/grispy/core.py in _build_grid(self, data, N_cells, dim, epsilon)
379 grid = {}
380 if N_cells ** dim < len(data):
--> 381 compact_ind = np.ravel_multi_index(
382 k_digit.T, (N_cells,) * dim, order="F")
383

<array_function internals> in ravel_multi_index(*args, **kwargs)

ValueError: invalid entry in coordinates array

I thought maybe there was a particular entry in my array that was causing an issue so attempted to narrow it down by running gsp = GriSPy(coords[:100]) , gsp = GriSPY(coords{100:200]) etc and found that any selection of 400 entries works, but as soon as I add a 401st entry I get the above error.

I've used this package on a different dataset with no issues.

Any thoughts?

Problema con el rango de los datos

Si en el archivo example.py se cambia la línea donde se generan los datos por
data = np.random.uniform(-10, 10, size=(Npoints, dim)),
Python devuelve el warning /home/ag/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:378: RuntimeWarning: divide by zero encountered in true_divide y el programa entra en un bucle infinito. En el ejemplo original el rango es -50, 50 y funciona bien. Bajando hasta -20, 20 no hay problema.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.