Git Product home page Git Product logo

abacusutils's Introduction

abacusutils

Abacus Logo

Documentation Status PyPI Tests pre-commit.ci status

abacusutils is a package for reading and manipulating data products from the Abacus N-body project. In particular, these utilities are intended for use with the AbacusSummit suite of simulations. The package focuses on the Python 3 API, but there is also a language-agnostic Unix pipe interface to some of the functionality.

These interfaces are documented here: https://abacusutils.readthedocs.io

Press the GitHub "Watch" button in the top right and select "Custom->Releases" to be notified about bug fixes and new features!

Installation

The Python abacusutils package is hosted on PyPI and can be installed by installing "abacusutils":

pip install abacusutils

or

pip install abacusutils[all]

For more information, see https://abacusutils.readthedocs.io/en/latest/installation.html.

Usage

abacusutils has multiple interfaces, summarized here and at https://abacusutils.readthedocs.io/en/latest/usage.html.

Specific examples of how to use abacusutils to work with AbacusSummit data will soon be given at the AbacusSummit website: https://abacussummit.readthedocs.io

Python

The abacusutils PyPI package contains a Python package called abacusnbody. This is the name to import (not abacusutils, which is just the name of the PyPI package). For example, to import the compaso_halo_catalog module, use

import abacusnbody.data.compaso_halo_catalog

Unix Pipes

The pipe_asdf Python script reads columns from ASDF files and pipes them to stdout. For example:

    $ pipe_asdf halo_info_000.asdf -f N -f x_com | ./client

abacusutils's People

Contributors

boryanah avatar dependabot[bot] avatar epaillas avatar lgarrison avatar pburger112 avatar pre-commit-ci[bot] avatar sandyyuan avatar sownakbose avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

abacusutils's Issues

Problem while trying to run the short example of AbacusHOD

Hi,

I'm running AbacusHOD through the new BinderHub.

First, I tried to run the first part of the process, running the prepare_sim code for z=0.500.

The first time, it took a few hours to reach slab number 33, producing two output files:
halos_xcom_32_seed600_abacushod_oldfenv_new.h5
particles_xcom_32_seed600_abacushod_oldfenv_new.h5

Next time, slab 31 and:
halos_xcom_30_seed600_abacushod_oldfenv_new.h5
particles_xcom_30_seed600_abacushod_oldfenv_new.h5

I also repeated for z = 0.200 and 0.100.

Now, when I run the short example, I receive this error:

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '.../output/subsamples/AbacusSummit_base_c000_ph000/z0.100/halos_xcom_0_seed600_abacushod_oldfenv_new.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Also, it creates empty folders in the output directory for galaxies.
.../output/galalxies/AbacusSummit_base_c000_ph000/z0.500

Add tutorials and reformat docs

To support tutorials and more readable docs, we should probably migrate to an Executable Book template and start writing notebook-style tutorials there.

Parallel_Numpy_Rng not found

Hello,

I am trying to use AbacusHOD but keep running into problems. I am following the examples provided here:

https://abacusutils.readthedocs.io/en/latest/hod.html

but when I run the command:

from abacusutils.abacusnbody.hod.abacus_hod import AbacusHOD

it tells me there is no module named 'parallel_numpy_rng'. I've tried looking online for resources regarding this package but haven't been able to find any. Where is this module originally from (i.e. where is it getting called from) and what may be some ways to fix this issue? I understand what I am writing is pretty vague, so I will be happy to elaborate further if necessary

Best,
Weston

missing info in metadata module

  1. c021 and c022: They are missing because they only exist as highbase boxes. Should be easy to fix.
  2. power spectra: would need to compress, limit the k-range, or reduce the k-resolution
  3. A_s: missing from the abacus params, so need to read from CLASS.ini

Thanks to @adematti for the report on item 1!

Sample AbacusHOD: index out of range

Hello again,

I am still working on getting AbacusHOD to work and I am stuck on the example where we construct an AbacusHOD object. When I run the provided cell, I get an error when the code tries to run the 'newBall' line. When I run this, the error message it gives me is:

~/.local/lib/python3.6/site-packages/abacusnbody/hod/abacus_hod.py in staging(self)
351 halo_info_fns =
352 list((sim_dir / simname / 'halos' / ('z%4.3f'%self.z_mock) / 'halo_info').glob('*.asdf'))
--> 353 f = asdf.open(halo_info_fns[0], lazy_load=True, copy_arrays=False)
354 header = f['header']
355

IndexError: list index out of range

I opened abacus_hod.py to check the variables being called and that led me to abacus.yaml. Currently the path to my asdf files is:

AbacusSummit_base_c000_ph000/halos/z0.000/halo_info

And the beginning of my abacus_hod.yaml file looks like:

Simulation parameters

sim_params:
sim_name: 'AbacusSummit_base_c000_ph000'
sim_dir: 'AbacusSummit_base_c000_ph000/halos/z0.100/halo_info/'
output_dir: 'AbacusSummit_base_c000_ph000/mock_summit/'
subsample_dir: 'AbacusSummit_base_c000_ph000/cleaned_summit/'
z_mock: 0.1
cleaned_halos: True

I've tried other variations of this but I still get the same error message. At this point I am unsure whether the error is due to defining these variables incorrectly or if it's due to something different. As always, please let me know what further information you need in order to help

Thank you in advance,
Weston

More descriptive error message when missing files

If the user requests subsample files that are not on disk, provide a more informative error message than "File not found". E.g. load_subsamples=True, but no field particles are on disk---a common situation!

Unpack cleaned subsamples directly into subsample table

Currently, we build a concatenated table of all cleaned particles, reindex it, merge it with the original subsamples, then do the RVint unpacking on the whole table. We may be able to achieve better performance by never constructing the concatenated cleaned particle table and instead do the RVint unpacking directly into the final location in the combined particle table.

add function to read multiple particle files

Currently, we expose a function to read a single particle file (read_abacus.read_asdf()). We should add a higher-level function that can read multiple files into a single table.

Basically, we want a smarter version of this snippet:

from pathlib import Path
from abacusnbody.data import read_abacus
import astropy.table
allp = []
for fn in Path('AbacusSummit_small_c000_ph3000/halos/z1.100/').glob(*_rv_*/*.asdf'):
    allp += [read_abacus.read_asdf(fn, load=['pos'])]
allp = astropy.table.vstack(allp)

Several features and bug fixes for AbacusHOD and light cones

Features:

  • output in the HOD light cone galaxy files the RA, DEC and CZ (also Zobs and Zreal) for each galaxy (or make that an option?)
  • when doing redshift space distortions on the light cone, currently it's taking the redshift of the entire halo catalog (which is more or less the median of that of all halos in it), but it would be more accurate to use the redshift_interp field (not sure if worth it and whether it would make AbacusHOD more cumbersome)
  • output in all the HOD galaxy files whether they were produced using light cones or cubic box
  • add scripts for applying arbitrary mangle masks to fits and dat AbacusHOD outputs and generating randoms
  • add scripts for computing convergence and shear maps.

Possible bug fix:

  • the particle subsampling in AbacusHOD is (I believe) resolution-dependent (also, note that currently only subsampleA particles are used; does this raise any issues with lower-mass samples?) -- need to fix that for using high and huge boxes.

Documentation:

  • add documentation and reference to the abacus_lc_cat repo

load_subsamples syntax

The "command string" syntax for the CompaSO load_subsamples is not intuitive. Should just let that argument accept a dict like dict(A=True, B=False, halo=True, field=False, pos=True, vel=False).

implement power spectrum tests against nbodykit

@boryanah added power spectrum computation routines in #65 (thanks!). We should add tests against nbodykit to the repo. Since #65 is closed, we can discuss and track progress here.

Responding to #65 (comment):

The code already looks fairly modular to me, but I agree the Pk stuff could go in its own file. And we could probably even make a new abacusnbody/clustering/ directory for the CF and Pk files, since they're in the hod dir right now but are useful beyond just HOD.

For the tests, you can just make test_power.py where the rest of the tests are. The mini sim has halos, particle subsamples, and particle slices; you can use any of these as the input data set.

test_power.py will look like:

from pathlib import Path

import pytest
from astropy.table import Table
import numpy as np

from common import check_close

curdir = Path(__file__).parent
refdir = curdir / 'ref_power'
EXAMPLE_SIM = curdir / 'Mini_N64_L32'
NBODYKIT_POWER = refdir / 'nbodykit_power.csv'

def test_power():
   '''Compare power spectrum against the saved nbodykit result
'''
    # TODO: this is all pseudo-ish code
    ref_power = Table.read(NBODYKIT_POWER)

    from abacusnbody.hod.power import compute_power

    pos = # load your dataset
    pk_paraks = { } # deltak, etc

    abacus_power = compute_power(pos, **pk_params)

    assert check_close(abacus_power['k'], ref_power['k'])
    for i in range(n_mubins):
         assert check_close(ref_power[i]['power'], abacus_power[i]['power'])

You can just run your nbodykit script locally and then save the result in a CSV file in the repo. That way the CI doesn't have to install/run nbodykit. But you can save it as tests/generate_nbodykit_power.py so someone else can run it in the future if needed.

I agree, it's probably worth matching the binnings between the codes. I usually find that checking the number of modes in each bin is the best way to confirm that the binnings are the same. In theory, the results should match very closely!

As you point out in the comments, there's opportunities for optimization/parallelization that could be done in the future, so it's important to get the tests in place before that work!

prepare_sim.main isssues

Hi all,

I am trying to use AbacusHOD, and to do so I must first run:

python -m abacusnbody.hod.prepare_sim --path2config /script/hod/config/abacus_hod.yaml

However when I do so, I get an error saying:

PermissionError: [Errno 13] Permission denied: '/mnt/marvin1'

I don't understand where this error is coming from or how to fix it. Do I have the correct .yaml file or it a different one I am supposed to use? I've also tried uninstalling and reinstalling abacusutils to see if that could help, but that doesn't seem to help. As always, if more info is need to address this issue, I will be happy to provide it.

Best,
Weston

Fix numba parallel segfault in `compute_Menv`

In #45, we found that using Numba parallel was causing a segfault in compute_Menv and calc_fenv in prepare_sim. It doesn't appear to be a problem with the functions themselves, but rather something at the library level. Probably related to using Numba parallel inside multiprocessing (even though we are using the nominally fork-safe workqueue backend).

Debugging this is probably a heavy lift, so for now we've just disabled Numba parallel for those two functions.

Add changelog checker

We should add a changelog bot/GitHub Action that gives us a green checkmark when we add a changelog entry with the right PR number.

Multi-threaded file IO for loading halo catalogs

On network file systems, we can probably get better IO performance in CompaSOHaloCatalog by reading multiple files in parallel. Hopefully the time-intensive parts of the IO release the GIL, so we could just spin up Python-level threads and use a queue to pass the loaded arrays back to the main thread.

Note that each IO thread will additionally spin off 1 to 4 Blosc threads to do the decompression.

Memory requirement for prepare_sim

Hi all,

I was wondering if any of you have successfully generated lightcones subsamples with prepare_sim on Perlmutter for z >= 0.8. The lower redshifts work fine, but for z = 0.8 the memory requirement hits the ~500 GB limit of a Perlmutter CPU node and the code chokes on this step

compiling compaso halo catalogs into subsampled catalogs
processing slab  0
loading halo catalog 
total number of halos,  40906121 keeping  9132292
masked randoms =  12.467233937923373
Building and querying trees for mass env calculation

My configuration file uses

prepare_sim:
    Nparallel_load: 1

(but I don't think this helps since we are procesing a single slab anyway).

Is there any workaround for this? I guess one could try decreasing the number of randoms for the environment calculation but this is already low relative to the number of haloes, so I don't know how safe that would be...

Cheers,
Enrique

Missing key in halo_data.

I'm trying to run AbacusHOD on the new "DESI2" simulation that Lehman has generated (Abacus_DESI2_c000_ph300). During construction of the AbacusHOD object it throws a missing key error:

newBall = AbacusHOD(sim_params,HOD_params,clustering_params)
File "/global/homes/m/mwhite/.conda/envs/abacus/lib/python3.9/site-packages/abacusnbody/hod/abacus_hod.py", line 153, in init
np.vstack((np.log10(self.halo_data['hmass']), self.halo_data['hdeltac'], self.halo_data['hfenv']))
.T,
KeyError: 'hdeltac'

where

sim_params: {'sim_name': 'Abacus_DESI2_c000_ph300', 'sim_dir': '/global/cfs/cdirs/desi/cosmosim/Abacus/', 'output_dir': './', 'subsample_dir': './', 'z_mock': 2.5, 'cleaned_halos': True}

Editing logM_cut in abacus_hod.yaml

Hi All,

I am trying to run the HOD module with an edited abacus_hod.yaml file; specifically with an edited LRG logM_cut value. So far I've tried directly editing the logM_cut value within the abacus_hod.yaml file, but when I generate correlation functions with the outputted .hdf5 files, the plot it makes is identical to the one plotted using the default logM_cut value. Are there more files I need to edit to generate modified .hdf5 files or is this an issue with my correlation function generation process?

Some issues from NERSC

A few issues and suggestions from my tests of AbacusHOD at NERSC. This was with v0.4.0, so possibly slightly out of date.

  • Nthread_load of 7 runs out of memory on Haswell nodes (128G). 4 works (maybe 5 or 6 would, too). Should lower the default; most users will be at NERSC.
  • The recommended value of mem/20 in the docs may not be conservative enough
  • Maybe put Nthread_load in a different YAML section since it only applies to the prep stage? Like prepare_sim: nthreads: 4?
  • Maybe nparallel instead of nthreads, trying to emphasize that it's not just threads but also memory that will go up?
  • The distinction between subsample_dir and scratch_dir is not super clear. Maybe rename them to prep_dir and output_dir, or something like that? Or prep_dir could live under prepare_sim: output_dir: '...', nthreads: ...
  • Is prepare_sim a clear enough name? Maybe preprocess_sim or sim_staging?
  • There's a lot of output that flies by when running these scripts, it's hard for an end user to tell if all is well. Maybe show a limited amount of progress information the user, and hide the rest behind a --verbose flag?
  • Do we need a config directory if it's just going to hold one file?
  • prepare_sims.py config_file would be more idiomatic than prepare_sims.py --path2config config_file, just because config_file is the primary argument. It can still be optional.
  • Can remove the >>> in front of the scripts in the ReadTheDocs, it makes it hard to copy-paste. It's usually only needed to interleave code and output.

Some functions in ZCV broken after power spectrum code changes

 line 1145, in apply_zcv_xi
    r_binc, binned_poles_zcv, Npoles = pk_to_xi(asdf.open(power_cv_tr_fn)['data']['P_k3D_tr_tr_zcv'], self.lbox, r
_bins, poles=config['power_params']['poles'])
  File "/global/homes/m/mwhite/.conda/envs/abacus/lib/python3.9/site-packages/abacusnbody/analysis/power_spectrum.py", line 595, in pk_to_xi
    _, _, binned_poles, Npoles, r_avg = bin_kmu(nmesh, Lbox, r_bins, muedges=muedges, weights=Xi, poles=poles, space='real')
TypeError: some keyword arguments unexpected

adopt superslab terminology

We want to refer to the CompaSO files as "superslabs", describing the concatenation of multiple slabs. Right now the code uses "chunk" in a few places; let's change that to "superslab" to be consistent with our papers.

Metadata in mock catalog files?

From #28 (comment):

One small thing I noticed while looking at the tests is that the ELG.dat and LRG.dat files don't flag that they came from halo light cones. In fact, they don't contain any simulation or cosmology information, just the HOD parameter information (Acen, Asat, etc). I've kind of lost track of what file format we're using to distribute AbacusSummit mocks, but we might want to check that we are echoing the simulation/cosmology information to those files. Maybe we're already doing that though, just not in the tests; @SandyYuan, do you know?

Improvements to the power spectrum and zcv module

The power spectrum code can be optimized and made to run faster. While for now we do not want to enable MPI on it, it is worth implementing multithreading/parallelization improvements. Several quick things include (but we don't need to limit ourselves to these):

  • Swapping scipy.fft with pyfftw which is multithreaded.
  • Painting the TSC/CIC grid can be done using multithreading straightforwardly.
  • For more robust performance (i.e. avoiding numerical roundoff errors), we can do the power spectrum binning in integers of the wavemodes rather than using floats.

On the side of the ZCV module:

  • Currently the zenbu calls are very slow, so there is room to improve those.
  • Also, it would be great to rewrite the parts that use Joe DeRose's code and add comments to them (i.e. tools_jdr.py)

Finally, it would be helpful to add notebooks with examples of how to use the power spectrum and the ZCV module.

Factorize dependencies

From cosmodesi/cosmodesiconda#1, if abacusutils is being folded into the DESI software stack, we ought to think about our dependencies and make sure they're all required. Furthermore, might separate them into multiple categories: the base dependencies that one will need if importing the code, and "extras" that one will need to run the examples/tests/scripts.

Trouble import abacusnbody

Hi there,

I am trying to import abacusnbody into jupyter-notebook but am having troubles. When I run the command:

from abacusnbody.data.compaso_halo_catalog import CompaSOHaloCatalog"

I get an error message saying:

No module named 'abacusnbody.version'

In the error message I see it's trying to go to the abacusnbody directory where int.py is (which I 've checked I have) yet it won't install. I am very much not tech savvy, so this may be an easy fix. If so, I apologize for the waste of time but will still appreciate the help.

Best,
Weston

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.