Git Product home page Git Product logo

parallel_statistics's Introduction

Overview

This package collects tools which compute weighted statistics on parallel, incremental data, i.e. data being read by multiple processors, a chunk at a time.

The tools available are:

  • ParallelSum
  • ParallelMean
  • ParallelMeanVariance
  • ParallelHistogram
  • SparseArray

All assume that mpi4py is being used among the processes, and are passed a communicator object (often mpi4py.MPI.COMM_WORLD).

Installation

For now you can install this package using:

pip install parallel_statistics

Documentation

Documentation can be found at https://parallel-statistics.readthedocs.io/

Example

The three tools ParallelSum, ParallelMean, and ParallelMeanVariance compute statistics in bins, and you add data to them per bin.

The usage pattern for them, and ParallelHistogram, is:

  • Create a parallel calculator object in each MPI process
  • Have each process read in their own chunks of data and add it using the add_data methods
  • Once complete, call the collect method to get the combined results.

Here's an example of splitting up data from an HDF5 file, using an example from the DESC tomographic challenge. You can run it either on its own, or under MPI with different numbers of processors, and the results should be the same:

import mpi4py.MPI
import h5py
import parallel_statistics
import numpy as np

# This data file is available at
# https://portal.nersc.gov/project/lsst/txpipe/tomo_challenge_data/ugrizy/mini_training.hdf5
f = h5py.File("mini_training.hdf5", "r")
comm = mpi4py.MPI.COMM_WORLD

# We must divide up the data between the processes
# Choose the chunk sizes to use here
chunk_size = 1000
total_size = f['redshift_true'].size
nchunk = total_size // chunk_size
if nchunk * chunk_size < total_size:
    nchunk += 1

# Choose the binning in which to put values
nbin = 20
dz = 0.2

# Make our calculator
calc = parallel_statistics.ParallelMeanVariance(size=nbin)

# Loop through the data
for i in range(nchunk):
    # Each process only reads its assigned chunks,
    # otherwise, skip this chunk
    if i % comm.size != comm.rank:
        continue
    # work out the data range to read
    start = i * chunk_size
    end = start + chunk_size

    # read in the input data
    z = f['redshift_true'][start:end]
    r = f['r_mag'][start:end]

    # Work out which bins to use for it
    b = (z / dz).astype(int)

    # add add each one
    for j in range(z.size):
        # skip inf, nan, and sentinel values
        if not r[j] < 30:
            continue
        # add each data point
        calc.add_datum(b[j], r[j])

# Finally, collect the results together
weight, mean, variance = calc.collect(comm)

# Print out results - only the root process gets the data, unless you pass
# mode=allreduce to collect.  Will print out NaNs for bins with no objects in.
if comm.rank == 0:
    for i in range(nbin):
        print(f"z = [{ dz * i :.1f} .. { dz * (i+1) :.1f}]    r = { mean[i] :.2f} ± { variance[i] :.2f}")

parallel_statistics's People

Contributors

joezuntz avatar

Stargazers

Jonathon Rubin avatar

Watchers

 avatar Erin Sheldon avatar Phil Marshall avatar James Cloos avatar Rachel Mandelbaum avatar  avatar

parallel_statistics's Issues

update ParallelMeanVariance variance, results to include sum weights^2

We would like to edit TXPipe's computation of the error, which includes the variance and sum of the weights (or total counts) per bin collected from the ParallelMeanVariance class.

We are hoping to change the value of sigma from this calculation to a more correct one, which includes a factor of N_eff, where N_eff is the sum of the weights squared divided by the sum of the squared-weights, also expressed in this link, equation 4, for clarity.

This issue is being submitted because while we can compute the sum of the weights squared in TXPipe, we need the sum of the squared-weights as part of the result expected from ParallelMeanVariance here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.