dingo-gw / dingo Goto Github PK

Dingo: Deep inference for gravitational-wave observations

License: MIT License

Python 100.00%

dingo's Introduction

Dingo

Dingo (Deep Inference for Gravitational-wave Observations) is a Python program for analyzing gravitational wave data using neural posterior estimation. It dramatically speeds up inference of astrophysical source parameters from data measured at gravitational-wave observatories. Dingo aims to enable the routine use of the most advanced theoretical models in analyzing data, to make rapid predictions for multi-messenger counterparts, and to do so in the context of sensitive detectors with high event rates.

The basic approach of Dingo is to train a neural network to represent the Bayesian posterior conditioned on data. This enables amortized inference: when new data are observed, they can be plugged in and results obtained in a small amount of time. Tasks handled by Dingo include

building training datasets;
training normalizing flows to estimate the posterior density;
performing inference on real or simulated data; and
verifying and correcting model results using importance sampling.

Installation

Pip

To install using pip, run the following within a suitable virtual environment:

pip install dingo-gw

This will install Dingo as well as all of its requirements, which are listed in pyproject.toml.

Conda

Dingo is also available from the conda-forge repository. To install using conda, first activate a conda environment, and then run

conda install -c conda-forge dingo-gw

Development install

If you would like to make changes to Dingo, or to contribute to its development, you should install Dingo from source. To do so, first clone this repository:

git clone [email protected]:dingo-gw/dingo.git

Next create a virtual environment for Dingo, e.g.,

python3 -m venv dingo-venv
source dingo-venv/bin/activate

This creates and activates a venv for Dingo called dingo-venv. In this virtual environment, install Dingo:

cd dingo
pip install -e ."[dev]"

This command installs an editable version of Dingo, meaning that any changes to the Dingo source are reflected immediately in the installation. The inclusion of dev installs extra packages needed for development (code formatting, compiling documentation, etc.)

Usage

For instructions on using Dingo, please refer to the documentation.

References

Dingo is based on the following series of papers:

https://arxiv.org/abs/2002.07656: 5D toy model
https://arxiv.org/abs/2008.03312: 15D binary black hole inference
https://arxiv.org/abs/2106.12594: Amortized inference and group-equivariant neural posterior estimation
https://arxiv.org/abs/2111.13139: Group-equivariant neural posterior estimation
https://arxiv.org/abs/2210.05686: Importance sampling
https://arxiv.org/abs/2211.08801: Noise forecasting

If you use Dingo in your work, we ask that you please cite at least https://arxiv.org/abs/2106.12594.

Contributors to the code are listed in AUTHORS.md. We thank Vivien Raymond and Rory Smith for acting as LIGO-Virgo-KAGRA (LVK) review chairs. Dingo makes use of many LVK software tools, including Bilby, bilby_pipe, and LALSimulation, as well as third party tools such as PyTorch and nflows.

Contact

For questions or comments please contact Maximilian Dax or Stephen Green.

dingo's People

Contributors

Stargazers

Watchers

Forkers

wushichao vivienr iphysresearch hectorestelles aspirinliang conan30113 nj2nu geraintpratten tjzsf zhaoty96 yanyuechuixue jafluri transientlunatic filippo-santoliquido alisword woshiluozhi

dingo's Issues

Domains

Just a few suggestions for how we implement domains, the first one being the
most significant:

Implement a time_translate method within each domain. That way the calling
function does not have to know how to implement this for each domain. (It
could get complicated for TimeDomain, for instance.) This method should take
an array of strain data and an amount to time translate by. Time shifting
gets called at several points throughout the code (waveform generation,
detector projection, GNPE inference) so it would be good to
implement it just once.
Rename UniformFrequencyDomain to FrequencyDomain and change the string
"uFD" to "FD". A uniform frequency domain is standard, so no sense confusing
people. We can later introduce NonuniformFrequencyDomain.
I think noise_std does belong with the domains, because it depends only on
the domain. Maybe it should be renamed white_noise_std, since it is the
standard deviation for white noise in each bin. For
NonuniformFrequencyDomain this could get more complicated, as it would be
frequency-dependent, so we need to think carefully about how to implement
that.
window_factor should maybe be moved out of FrequencyDomain, since this
depends on how we take our FFTs when we estimate the noise PSD, and this is
not necessarily known when we use the domain to build a dataset of waveforms.
However the best place for this is not obvious; maybe it belongs with the
noise, which we haven't really dealt with yet. Any thoughts?

In general the only domain we care about at this point is FrequencyDomain,
and later we almost certainly also want NonuniformFrequencyDomain. It's okay to leave
TimeDomain not fully implemented throughout for now.

Profile synthetic phase

Why is sampling synthetic phase so much slower than a single likelihood evaluation? Profiling the code with print statements

        print(
            f"Modes: {t_modes:.3f} ({t_modes/t_total * 100:.1f}%)\t\t"
            f"Inner products: {t_inner:.3f} ({t_inner/t_total * 100:.1f}%)\t\t"
            f"Phase loop: {t_phase:.3f} ({t_phase/t_total * 100:.1f}%)"
        )

we get:

Modes: 0.189 (50.2%)		Inner products: 0.002 (0.7%)		Phase loop: 0.184 (49.1%)
Modes: 0.189 (47.1%)		Inner products: 0.003 (0.6%)		Phase loop: 0.209 (52.3%)
Modes: 0.195 (48.0%)		Inner products: 0.002 (0.6%)		Phase loop: 0.209 (51.4%)
Modes: 0.186 (48.1%)		Inner products: 0.002 (0.6%)		Phase loop: 0.198 (51.3%)
Modes: 0.191 (46.9%)		Inner products: 0.002 (0.6%)		Phase loop: 0.213 (52.5%)
Modes: 0.190 (45.1%)		Inner products: 0.003 (0.6%)		Phase loop: 0.229 (54.3%)
Modes: 0.189 (44.6%)		Inner products: 0.007 (1.6%)		Phase loop: 0.229 (53.8%)
Modes: 0.212 (49.5%)		Inner products: 0.002 (0.6%)		Phase loop: 0.214 (49.9%)
Modes: 0.207 (48.4%)		Inner products: 0.002 (0.6%)		Phase loop: 0.218 (51.0%)
Modes: 0.219 (51.3%)		Inner products: 0.002 (0.6%)		Phase loop: 0.206 (48.1%)
Modes: 0.191 (51.5%)		Inner products: 0.002 (0.7%)		Phase loop: 0.177 (47.9%)
Modes: 0.194 (51.9%)		Inner products: 0.002 (0.6%)		Phase loop: 0.177 (47.5%)
Modes: 0.357 (64.9%)		Inner products: 0.003 (0.5%)		Phase loop: 0.190 (34.6%)
Modes: 0.422 (68.9%)		Inner products: 0.002 (0.4%)		Phase loop: 0.188 (30.7%)

The phase loop can likely be vectorized, but is there a way to speed up mode computation? This seems to take much longer than a simple waveform evaluation, although in the lalsimulation backend both should compute the same things.

git hash to model settings

It could be handy to add an option for saving the git hash to the model settings.

In a terminal, one can simply get the code by git rev-parse HEAD, within python one can use gitpython https://stackoverflow.com/questions/14989858/get-the-current-git-hash-in-a-python-script.

Creating Documentation

🐞 What bugs you?
Some documentation would be helpful for new users who want to utilize dingo.

🎯 Goal
Create basic HTML documentation along with an Installation and Getting Started page.

💡 Possible solutions
Use Sphinx's autodoc solution to automatically import docstrings into the documentation. Looks like we can use the numpy type docstrings instead of the :: docstrings if the sphinx-napoleon extension is installed.

GNPE kernels to bilby format

The current format of the GNPE kernel in the settings file

  gnpe_time_shifts:
    kernel_kwargs: {type: uniform, low: -0.001, high: 0.001}
    exact_equiv: True

requires a wrapper to extract the kernel. We can circumvent this by using the Bilby distributions, which also allow for efficient sampling:

  gnpe_time_shifts:
    kernel: bilby.core.prior.Uniform(minimum=-0.001, maximum=0.001)
    exact_equiv: True

Parameter marginalization

It may be useful to give the option in training to marginalize over uninteresting parameters such as the phase of coalescence, or some of the spin angles. This would reduce the complexity of the posterior and therefore likely improve performance. We could do this by just having a list of marginalized parameters, for which we drop labels during training.

Training stages

I am trying to think of a way to enable several stages of training, e.g., pre-training with a fixed PSD, and main training with variable PSD. One way I can think of is to specify in train_settings.yaml something like this:

# Usual stuff up here. This specifies the initial training settings ("stage 0")
noise: fixed_asd.txt
freeze_rb_layer: True

training_stage_1:
  start_epoch: 300  # Start second stage at 300 epochs
  noise: asd_dataset_O3.hdf5  # Only specify the settings that change from the initial settings.
  freeze_rb_layer: False
  scheduler:  ... # whatever
  
training_stage_2:
  start_epoch: 500  # If we wanted yet another stage

The code would then use the initial settings until epoch 300, at which point it rebuilds the transforms (and possibly the scheduler) based on the second-stage settings. It would obviously have to check for compatibility of the stages (e.g., so that we don't change the neural archictecture, etc.). We could also alter the settings mid-run, so long as we only change stages that begin after the current epoch.

What do you think? I'm also of course open to other suggestions.

dingo inference discussion

My suggestion for the inference script is as follows:

dingo_inference 
  --model /path/to/model.pt
  --GPS_times 1126259462.4 1128678900.4
  --dataset_file /path/to/event_dataset.hdf5

The model contains data settings. These can be used to extract the format for the strain data and PSDs (domain, window, ...)
The script then downloads the strain data + PSDs with settings consistent with the model
Since this is costly, we should (optionally) cache the data: if a dataset_file is specified, the script
1. Checks whether the settings in dataset_file are consistent with the model settings -> if not, abort
2. Checks, whether the GPS times are in the dataset -> if not, download the data, save it to the file
The dataset would inherit from DingoDataset, and as usual, save all the settings in f.attrs["settings"]
The script then performs inference for all GPS times

Extensions

I am not entirely happy to specify GPS times instead of events (i.e., --events GW150914 GW151012); but the GPS times are required. Maybe we could pass --events GW150194 1126259462.4 GW151012 1128678900.4 instead. The dataset_file then saves a mapping between events and GPS times, such that for cached events it suffices to specify the event name when calling the inference script.
The script should automatically detect whether we do GNPE or not. It will be a bit of work implementing that, but it should be straightforward and similar to what we do for the transformations in train setup. Maybe we could use a function shared between training and inference to build GNPE transforms. It would return the identity transformation if no GNPE arguments are specified. It might also need a flag training/inference.
For GNPE we also need a second model for the pose initialisation. This would need to be modified as another argument.

With these considerations, the interface would look as follows.

dingo_inference 
  --model /path/to/model.pt 
  --model_gnpe_init /path/to/pose_model .pt
  --events GW150914 GW151012
  --dataset_file /path/to/event_dataset.hdf5

Memory Problems During Inference with GNPE

For some reason when calling the inference code, CUDA runs out of memory.

dingo_posterier = dingo.gw.inference.sampling_functions.sample_posterier_of_injection(strain_data, model=main_pm, model_init=time_pm, device='cuda', num_samples=50_000, batch_size=10_000)

Will run out of memory on a 40GB A100 TPU.

Temporary workaround:

def sample_posterier(strain_data, main_pm, time_pm):
    def memory_wrapper(strain_data, main_pm, time_pm):
        dingo_posterier = dingo.gw.inference.sampling_functions.sample_posterier_of_injection(strain_data, model=main_pm, model_init=time_pm, device='cuda', num_samples=10_000)
        return dingo_posterier

    # 50,000 samples samples 5 times from the network, should be deterministic ? Done for memory reasons
    for i in range(5):
        dingo_posterier_temp = memory_wrapper(strain_data, main_pm, time_pm)
        if not 'dingo_posterier' in locals():
            dingo_posterier = {k:v.cpu().numpy() for k,v in dingo_posterier_temp.items()}
        else:
            dingo_posterier = {k:np.concatenate((dingo_posterier[k], dingo_posterier_temp[k].cpu().numpy()), axis=None) for k,v in dingo_posterier.items()}

    return dingo_posterier

dingo_posterier = sample_posterier(strain_data, main_pm, time_pm)

This samples 50,000 points from the posterior by sampling 10,000 points 5 times and concatenating. It does not run out of CUDA memory. I found this odd at first since I'd expect this is exactly what batching should do but I see why now it is slightly different (see below).

I debugged the program profilling it with a realtime GPU memory reporter here's what I found:

In the ResetSample transform in the inference_transforms.py the waveform gets cloned with sample["waveform"] = sample["waveform_"].clone() to make sure the WF doesn't get edited through the transforms. I think later down the line this may get deleted but freeing up the memory may be useful for the rest of the transform pipeline.

In the gnpe_pre_transforms eventually FrequencyDomain.time_translate_data gets called. Here large tensors like: cos_txf = torch.empty((batch_size, Nd, Nf), device=data.device) are created. I found that this batch_size is not the 10_000 passed to the inference api function but rather the 50_000 of the number of samples requested. This is effectively what is causing the memory shutdown. It also explains why my memory wrapper works when using a size of 10_000 but a batch_size of 10_000 does not. I guess this makes sense since the batching only applies to sampling from the main network posterior and not the time network posterior. So it is passing 50_000 samples as context to the main network and causing a memory shutdown. Could also use pytorch dataparallel but I think that is not a great solution here.

I think the reason I am the first to run into this is my tensor sizes are torch.Size([50000, 2, 3, 16225]) the 16225 is double 8193 since I am looking at frequencies from 20 to 2048 instead of 20 to 1024.

TLDR: I don't think it's actually a memory leak persay but sampling a large number of points does run into memory issues. Batching the FrequencyDomain.time_translate_data function may work but it may be tricky so I figure just using the workaround should be ok for now.

Memory leaks during training

Using nvidia-smi -l 1 to monitor GPU during training, I notice that the GPU RAM usage slowly increases during each epoch (at least during the first training run). This adds up to ~ 100 MB / epoch.

Also, when loading for the first time from a checkpoint, the GPU RAM usage is higher (by ~ 500 GB) than before.

Prior split into intrinsic/extrinsic

It might make more sense to change the GWPriorDict to have separate
instances for intrinsic/extrinsic priors:

This simplifies the code because this class would not need to distinguish
intrinsic and extrinsic parameters (we would not need methods such as
sample_extrinsic and sample_intrinsic, just sample).
It allows the user to choose the partitioning of parameters into
intrinsic/extrinsic sets.
Another reason for this change is that at data generation time, only intrinsic
parameters need to be known. At train time, the user may want to experiment
with different extrinsic priors. Intrinsic priors could then be specified in
a data_config.yaml file, extrinsic in training_config.yaml.
Reference luminosity distance could be implemented with a delta distribution
in the intrinsic prior. This would mean that the luminosity distance appears
in both the intrinsic and extrinsic prior.
Reference frequency should probably be associated with the waveform
generation, not necessarily the prior. It would then be in data_config.yaml.
The reference geocenter time is used to specify the overall detector
position, right? This could potentially be part of the extrinsic prior,
although perhaps it should be instead associated with the detectors, and
specified in the training_config.yaml file.

I'm not sure if this would require separate intrinsic/extrinsic prior
classes, or if they could just be separate instances of a more general prior
class.

Minor issues:

Are default values needed for functions? This could cause errors if they are
relied on at some point.
dict format, with class_name entry is a bit opaque, since the user
shouldn't have to know about classes. Is this standard for Bilby?

Condor output files being overwritten

When generating a waveform dataset with condor (see dingo_generate_dataset_dag), stdout is redirected to files in condor/output/. For each job, a separate output file is created. However, each job may contain several processes (e.g., when splitting waveform generation across nodes). The output of these separate processes is all directed to the same file, which causes output from earlier processes to be overwritten by later ones.

To fix this, we should specify the output file names to include the process numbers. I am not sure how to do this, however, since the pycondor Job constructor only takes the output directory name, not the file name.

Clean up output of train scripts

The print output of the train script is currently extremely long. When first building the model, it prints >300 lines (see comment below).

Some of this is from testing the reduced basis and should be fixed with issue #45.
There are hundreds of lines of model settings. I feel that they should not all be printed. Instead we could maybe print the most important settings only (just a couple of lines), and then a statement along the lines of for more information run dingo_info model.pt. This would require the implementation of dingo_info, which for a model should be pretty straightforward. (In general, this function should also work for datasets later).
Currently I believe the loss is printed in each iteration. We should set a more sensible default, such as 10 prints per epoch.

We could also think about adding a --verbose flag. If this is set, the train script should automatically print the dingo_info stuff after initialisation of the model, and the loss would be printed in each iteration (the latter would e.g. help with the analysis of loading times of the data loader).

Code formatting

It would be good to use a code formatter such as Black to ensure a consistent style. I think this can be set up to run automatically on commits, e.g, as discussed on

I also set this up locally so that I can run it within PyCharm.

Memory usage during dataset generation

When generating a waveform dataset, the code seems to use too much memory.

The single-node code first produces a dataset for training the SVD. These waveforms are uncompressed, and therefore take up a lot of space. For a 50,000-element training set, this should be roughly

50,000 elements * 8,000 frequency bins * 8 bytes * 2 polarizations = 6.4 GB

This is stored in the polarizations dictionary and again in train_data array.

When generating the SVD the memory usage jumps up to ~ 40 GB. I then run

del parameters, polarizations, train_data

to try to free up the memory from the training data, since this is no longer needed.

Next, the main dataset is generated over multiple processes. Each process uses ~ 13 GB of memory, though, and I cannot see why this should be the case if the large waveforms are no longer being stored.

Model parameter names bug?

Names of model parameters on my laptop and remote machines do not agree, which interferes with freezing of embedding net layers. Is this because of different pytorch versions?

Matching results to models

For reproducibility of results it would be good to store a unique identifier of the neural network models used to produce them. Many groups may be training models, and we need a way to match samples to trained models.

One idea would be to calculate a hash of a model (or its weights?) which could be saved in the metadata of a samples file (and the model file?). It would be good to have a way to go from this identifier to the model itself, though, so maybe we need to have a table of models and hashes. But then who maintains this table? How long does it take to calculate such a hash?

I am open to suggestions on the best way to solve these problems. Any thoughts?

Raise high-level warning when using GNPE model for GNPE initialization

For GNPE, we need an NPE network to initialise the GNPE proxies. When accidentally setting an NPE network, no warning is being raised, but instead inference fails at runtime with the low-level error

File "/dingo/core/nn/enets.py", line 262, in forward
    raise ValueError('Invalid number of input tensors provided.')
ValueError: Invalid number of input tensors provided.

since the network expects (but does not get) GNPE proxies.

There should be a check that raises a higher level warning when this error occurs, since this would be quite a common mistake, and the above error is hard to interpret.

Dingo use cases

DINGO use cases

Let's assemble all of the planned use cases of the dingo code below.,
separated into training and inference. This should be useful input for
designing the classes and their interaction.

The current sketch is quite high level and more details are needed
for this to properly describe what the code should do.

Train NDE

Read parameters - from some agreed upon configuration file

waveform model
prior
PSDs
parameter list(intrinsic / extrinsic)
network hyperparameters and data set specific settings

build waveform data set: (5 million waveform samples from intrinsic prior)
add noise realizations and extrinsic parameters to waveform
(now using stationary Gaussian noise; support real noise in future)

embedding network: strain, PSD -> SVD coefficients
support nice way of treating intrinsic / extrinsic parameters -- needs to know some transformations

train network given hyperparameters
save network and parameters (metadata, e.g. how the PSD was determined)

Perform inference with NDE on single GW event

Collect information about event, and analysis settings in ini file (similar to bilby-pipe)

includes: IFOs, channels, PSDs, waveform model
Get stretch of strain data - handle in a way that is consistent with how PSD was computed
get data from gracedb, or use datafind, or use frame files

Select appropriate trained network
Setup condor job / DAG for inference or run directly
Run inference job
4.1. Draw posterior samples with iterative scheme
4.2. Save posterior in standard format (json, HDF5)
4.3. Produce output page with pesummary

Extensions

Support non-uniform frequency grids

Useful for long segment lengths

Add flexibility about which parameters to sample in

Support prior specifications similar to bilby?

importance sampling for reweighting
it is convenient to sample in uniform distance

output additional quantities

SNR
evidence
logL - calculated from samples

Support subdomains in parameter space

This is already done in distance.

Marginalization over detector calibration

(a) Could compute logL for each sample and then run the reweighting code
(make sure that you have enough samples in the tail)
(b) Put realizations of det cal uncertainty into waveform data set and rebuild network

Non-Gaussian noise

Extending to arbitrary noise is straightforward since no Gaussianity assumptions are made

SEOBNRv4PHM

Could train on SEOBNRv4PHM and on SEOBNRv4PHM_surrogate separately
Can then compare how close the posteriors are (mass-ratio not too large because of different frame)

Flexibility to support ground based detectors and LISA

Flexibility to change the flows

Maybe there is something better than a spline flow

Have sane default parameters

Look into PSD reweighting

Could reweighting be used for changing from an initial PSD estimate to a more accurate one?

Improve load balancing between GPU and CPU

This is not a continuously adjustable thing: decide which pieces should go on the GPU

Support multiple GPUs

In the future will maybe want to make use of multiple GPUs

currently, single DGX, training time = 1 month
LIGO has multiple DGXs for testing and we'll get them as well (fall?)
not very efficient on multiple GPUs; change in pytorch code should be easy if needed

Test loss not decreasing for some datasets

When I train a network for a chirp mass prior starting from 10 M_\odot, the test loss does not decrease properly. It stays ~ 24, while the train loss decreases. This bad loss persists even when I take train data and evaluate the network in eval mode, so I suspect it has something to do with the BatchNorm1D layers. Probably the low-Mc waveforms have large fluctuations and therefore the batch-norm running averages fail. However, it's hard to see why this would be the case.

settings.yaml:

# settings for domain of waveforms
domain:
  type: FrequencyDomain
  f_min: 20.0
  f_max: 1024.0
  delta_f: 0.125  # Expressions like 1.0/8.0 would require eval and are not supported
  window_factor: 1.0 # This should maybe be in noise settings? It is not used for the generation of waveform polarizations

# settings for waveform generator
waveform_generator:
  approximant: IMRPhenomXPHM  # SEOBNRv4PHM
  f_ref: 20.0  # Hz

# settings for intrinsic prior over parameters
intrinsic_prior:
  # prior for non-fixed parameters
  mass_1: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0)
  mass_2: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0)
  mass_ratio: bilby.core.prior.Uniform(minimum=0.125, maximum=1.0)
  chirp_mass: bilby.core.prior.Uniform(minimum=10.0, maximum=100.0)
  phase: default
  a_1: bilby.core.prior.Uniform(minimum=0.0, maximum=0.88)
  a_2: bilby.core.prior.Uniform(minimum=0.0, maximum=0.88)
  tilt_1: default
  tilt_2: default
  phi_12: default
  phi_jl: default
  theta_jn: default
  # reference values for fixed (extrinsic) parameters
  luminosity_distance: 100.0 # Mpc
  geocent_time: 0.0 # s

# Number of samples in the dataset
num_samples: 5000000

# Save a compressed representation of the dataset
compression:
  svd:
    num_training_samples: 50000
    # Truncate the SVD basis at this size. No truncation if zero.
    size: 200

train_settings.yaml:

# Settings for data generation
data:
  waveform_dataset_path: ./waveform_dataset.hdf5  # Contains intrinsic waveforms
  train_fraction: 0.95
  # data conditioning for inference
  window:
    type: tukey
    f_s: 4096
    T: 8.0
    roll_off: 0.4
  detectors:
    - H1
    - L1
  extrinsic_prior:
    dec: default
    ra: default
    geocent_time: bilby.core.prior.Uniform(minimum=-0.10, maximum=0.10)
    psi: default
    luminosity_distance: bilby.core.prior.Uniform(minimum=100.0, maximum=2000.0)
  ref_time: 1126259462.391
  gnpe_time_shifts:
    kernel_kwargs: {type: uniform, low: -0.001, high: 0.001}
    exact_equiv: True
  selected_parameters: default # [chirp_mass, mass_ratio,  luminosity_distance, dec]

# Model architecture
model:
  type: nsf+embedding
  # kwargs for neural spline flow
  nsf_kwargs:
    num_flow_steps: 30
    base_transform_kwargs:
      hidden_dim: 512
      num_transform_blocks: 5
      activation: elu
      dropout_probability: 0.0
      batch_norm: True
      num_bins: 8
      base_transform_type: rq-coupling
  # kwargs for embedding net
  embedding_net_kwargs:
    output_dim: 128
    hidden_dims: [1024, 1024, 1024, 1024, 1024, 1024,
                  512, 512, 512, 512, 512, 512,
                  256, 256, 256, 256, 256, 256,
                  128, 128, 128, 128, 128, 128]
    activation: elu
    dropout: 0.0
    batch_norm: True
    svd:
      num_training_samples: 50000
      num_validation_samples: 5000
      size: 200

# Training is divided in stages. They each require all settings as indicated below.
training:
  stage_0:
    epochs: 300
    asd_dataset_path: ../asds_O3_fiducial.hdf5 #/home/jonas/Desktop/dingo-devel/tutorials/02_gwpe/datasets/ASDs/asds_O2.hdf5
    freeze_rb_layer: True
    optimizer:
      type: adam
      lr: 0.0002
    scheduler:
      type: cosine
      T_max: 300
    batch_size: 8192

  stage_1:
    epochs: 150
    asd_dataset_path: ../asds_O3.hdf5 #/home/jonas/Desktop/dingo-devel/tutorials/02_gwpe/datasets/ASDs/asds_O2.hdf5
    freeze_rb_layer: False
    optimizer:
      type: adam
      lr: 0.00002
    scheduler:
      type: cosine
      T_max: 150
    batch_size: 8192

# Local settings for training that have no impact on the final trained network.
local:
  device: cuda
  num_workers: 32 # num_workers >0 does not work on Mac, see https://stackoverflow.com/questions/64772335/pytorch-w-parallelnative-cpp206
  runtime_limits:
    max_time_per_run: 3600000
    max_epochs_per_run: 500
  checkpoint_epochs: 10

ASD standardization

During training, we currently standardize the ASD by scaling it by 1e23 before passing it as input to the embedding network. However, even with this, it still varies by several orders of magnitude from frequency bin to frequency bin. It would be good to have a better way of standardizing the ASD, so that each element is ~ unit normal.

Perhaps a simple approach would be to scale by a smooth fiducial ASD. Using a smooth fiducial ASD is important because we want to avoid any dividing by zero at spectral lines.

I realize this is different from what we did in the PRL, and we still want to be able to compare.

Adding Transforms for Injections

I have been using the waveform transform pipeline for creating injection datasets. It could be useful to have a few more transforms that allow us to easily manipulate injection datasets. This could be used for reproducibility or searching intrinsic parameters space. For example I have been using:

An edited AddWhiteNoiseComplex transform which allows you to pass a custom noise array when appending the transform to the waveform dataset transform list.

A new CustomExtrinsicParameters transform which allows you to pass a specified list of extrinsic parameters which will be used for the entire dataset.

A new UnWhitenAndScaleStrain transform which allows you to whiten/scale then sample noise then unwhiten and unscale thereby generating a waveform which we can use the inference pipeline on.

I have currently implemented it locally but if it could be useful for the code I can submit a PR. Wanted to open an issue first to see what everyone's thought on it was first though. The only transform I haven't updated is a custom H1_time L1_time transform which would use the same H1_time and L1_time everytime.

Give gw related scripts more specific names

The console scripts in setup.py (see below) related to GW's should not have a generic name, instead the name should explicitly include 'gw'. E.g., instead of dingo_generate_dataset it could be dingo_gw_generate_dataset.

      entry_points={'console_scripts':
          ['dingo_generate_dataset=dingo.gw.dataset.generate_dataset:main',
           'dingo_generate_dataset_dag=dingo.gw.dataset.generate_dataset_dag:main',
           'dingo_merge_datasets=dingo.gw.dataset.utils:merge_datasets_cli',
           'dingo_build_svd=dingo.gw.dataset.utils:build_svd_cli',
           'dingo_generate_ASD_dataset=dingo.gw.ASD_dataset.generate_dataset:main',
           'dingo_train=dingo.gw.training:train_local',
           'dingo_train_condor=dingo.gw.training.train_pipeline_condor:train_condor',
           'dingo_append_training_stage=dingo.gw.training:append_stage']
      },

Saving of metadata for `DingoDataset`

When settings / metadata are saved to HDF5 in a DingoDataset, the dictionary is converted to a string. When loading, the string is converted back to a dictionary using ast.literal_eval(). This method was used to ensure that we could handle any data types that might appear in the settings dictionary.

It would be good to instead treat settings / metadata as any other DingoDataset saved attribute (specified by data_keys). In other words, the dictionary would be saved in HDF5 as a set of nested groups and datasets / attributes. It will be important to thoroughly check this to be sure all the values can be properly saved and loaded, and a conversion script should be written to convert the old file format to the new one.

Review code for LinearProjectionRB

I'm looking at the commit with git hash f1445ef

This looks pretty nice, I have some comments and suggestions below.

LinearProjectionRB

remove unused imports:

   import numpy as np
   from torch.nn import functional as F
   from nflows.nn.nets.resnet import ResidualBlock

main docstring:
- "Each block has num_channels >= 2 channels. Channel 0 and 1 represent the real and imaginary part, respectively."
  - What happens if num_channels > 2? What is the interpretation? The code in init_layers assumes num_channels = 2.
- Give an interpretation for "num_bins": i.e. time of frequency bins of a vector.
- "the output dimension is 2 * n_rb * num_blocks"
  - Put that after the Parameter description. Again there is an assumption that
    num_channels = 2 in "2 * n_rb * num_blocks", i.e. complex input vectors [Re, Im]^T in the
    flattened channels and bins dimensions.
- "V_rb : list of np.arrays list with v matrices of the reduced basis SVD projection"
  - Say what a "v" matrix is. It assumes an SVD decomposition of a matrix into U @ S @ V^h
- Say what the dimensionality of the compressed output is.
init():
- Maybe put line "self.num_blocks, self.num_channels, self.num_bins = self.input_dims" before
  call to "self.test_dimensions()"? Then could refer to the self.num_blocks etc in the asserts
  to make it easier to understand.
- "self.n_rb * 2": again assuming num_channels = 2
output_dim(): again assuming num_channels = 2
test_dimensions():
- Consider using self.num_blocks, etc as suggested above for clarity.
- I would argue that raising exceptions would be better than using asserts
  for checking user input. E.g.

     if len(input_dims) != 3:
         raise ValueError("Exactly 3 axes required: blocks, channels, bins")

- See e.g. https://stefan.sofa-rockers.org/2018/04/16/assertions-and-exceptions/ for a discussion
of when to use assertions vs exceptions.

init_layers():
- "# loop through layers and initialize them individually" -> turn this comment into a doc string!
- Say in docstring what shape and its elements V_rb have
- Perhaps use capital letters for matrices, i.e. "v" -> "V"?
- Perhaps use shorthands for improved readability:

       k = self.n_rb
       n = self.num_bins
       w[:k, :n] = ...

- Use `.transpose` instead of `permute(1, 0)` since there are only two indices involved

forward():
- Perhaps use a list comprehension.
- x = torch.cat(out, dim=1) I wouldn't call the output x when the input is called x.
  A new name would just be a reference, right?

testutils

generate_1d_datasets_and_reduced_basis()
- The code in plot would fail if plot=True, since n_rb is not defined.
project_onto_reduced_basis_and_concat():
- break long lines:

        projections = [[(y[:] @ V_rb[:, :n_rb]).real, (y[:] @ V_rb[:, :n_rb]).imag]
            for y, V_rb in zip(y_list, V_rb_list)]

- list comprehensions with multiple iterations are hard to read, because the order the
loops is kind of backward.

       projections = [item for sublist in projections for item in sublist]

  Breaking it over a few lines makes it a bit more readable

       projections = [item 
           for sublist in projections  # Loop 1
           for item in sublist         # Loop 2
       ]

  Or consider using itertools:

       from itertools import chain
       list(chain.from_iterable(projections))

Is this going to be used to test other code? testutils is a pretty generic name,
so perhaps rename it to make this more precise.

test_enets

Is this the only test possible?
- Make sure to check that the code breaks when you pass in nonsensical input.
- What happens when num_blocks > 2?
- If you have multiple tests with a common setup / teardown you could arrange them in a class.

Defining the standard deviation of standard normal noise

I am still a bit confused by the definition of noise_std quantities in the code.

I understand that we want to generate noise that is standard normal in the time domain.
If we work in the Fourier domain then noise will still be normal with zero mean, but with a different standard deviation which has to take into account conventions for the inverse DFT.

Here is a piece of code to look at this.

def generate_TD_noise_in_the_FD(fs=4096.0, duration=16.0,
                                noise_std_option='A'):
    """
    Generate time-domain noise in the Fourier domain.

    fs:
        sampling rate [Hz]
    duration:
        waveform duration in the time domain [s]
    noise_std_option:
        option for defining the stdev of the noise for the
        real an imaginary components in the Fourier domain
    """
    delta_f = 1.0/duration
    f_max = fs / 2.0
    n_samples = int(f_max / delta_f) + 1
    sample_frequencies = np.linspace(0.0, f_max, num=n_samples,
                                     endpoint=True, dtype=np.float32)
    
    if noise_std_option == 'A':
        noise_std = np.sqrt(n_samples)
    elif noise_std_option == 'B':
        noise_std = 1.0 / np.sqrt(4.0 * delta_f)
    else:
        raise ValueError('Undefined noise_std option.')

    n_white = np.random.normal(loc=0, scale=noise_std, size=n_samples) + \
         1j * np.random.normal(loc=0, scale=noise_std, size=n_samples)
    
    frequency_domain_strain = n_white
    time_domain_strain = np.fft.irfft(frequency_domain_strain) # real TD noise
    # `np.fft`: The default normalization has the direct transforms unscaled 
    # and the inverse transforms are scaled by :math:`1/n`.
    
    plt.hist(time_domain_strain, 50, density=True, label=r'FD: $\sigma = %.4f$' % 
             np.std(time_domain_strain))
    xi = np.linspace(-5, 5, 100)
    pdf = stats.norm.pdf(xi)
    plt.plot(xi, pdf, label='TD PDF')
    plt.legend()
    
    return np.std(time_domain_strain)

The function has two options for the standard deviation of the FD noise.

option A: sqrt(n_samples) -- this is the correct number for the above code (see the output below) and can also be derived analytically.
option B: 1.0 / np.sqrt(4.0 * delta_f) -- this seems to be missing a factor sqrt(2 * fs) compared to option A

The output of the function when called with these options is shown below.

Question: If option B is correct, what am I missing?

ASD dataset config file

I am working on documentation, and had a couple comments / questions regarding the ASD dataset generation, particularly the config file:

It would be good to remove the outermost dataset_settings level in the config file. Since there is only one key at the outermost level, it does not need to be there.
The f_s option is a bit unclear. Does this specify the data channel (4k vs 16k) to use? Or is it implicitly specifying the Nyquist frequency at which to truncate the ASDs? Looking at the code, I'm not sure it would be consistent if we chose anything other than 4k here.

Suggestions: linter, code analyzers, CI, sphinx

General comments:

README

Should python setup.py bdist_wheel be added to the setup instructions in the README?
Also need source venv/bin/activate

Code quality checking

Add a linter, formatter:
- pylint is the most commonly used tool for linting in python.
- https://pylint.org/
- pip install pylint -- should add it in the setup.py file
- e.g. pylint dingo/core/nn/enets.py
- can be a bit pedantic, e.g. snake-case variables
- could be configured to shut up when it is too distracting
- PyCharm also has a builtin linter which flags things, e. g. unused imports
Find unused code by using a code analyzer:
- https://coverage.readthedocs.io/en/coverage-5.5/
- Coverage.py is a tool for measuring code coverage of Python programs. It monitors your program, noting which parts of the code have been executed, then analyzes the source to identify code that could have been executed but was not.
- Coverage measurement is typically used to gauge the effectiveness of tests. It can show which parts of your code are being exercised by tests, and which are not.

     # 1. Run tests and coverage
      coverage run --source dingo/  -m pytest tests
      # 2. Generate report
      coverage report -m  # ASCII
      coverage html && open htmlcov/index.html  # HTML

Continuous integration

Will want to have the linter, pytest and coverage run automatically for each pull_request. To do that use continuous integration, e.g. circleci on github.

Documentation

Eventually will want to autogenerate documentation, e.g.
- https://www.sphinx-doc.org/en/master/

Pythonic interface to dataset generation

It would be good to have a Python interface to dataset generation, so that one can interactively call a function and be returned the matrix V for the reduced basis and the arrays of compressed waveforms. This would require disentangling the disk IO from the actual dataset generation.

Use `lalsimulation` frame conversion for IMRPhenomXPHM

IMRPhenomXPHM modes are given in the J frame and must be converted to the L0 frame in order to sample the synthetic phase. This requires knowing the Euler angles that specify the transformation. At present, Dingo implements the calculation of these angles, but due to minor discrepancies in PN orders, this leads to mismatches up to 1e-3 (but typically much smaller). It would therefore be better to use the implementation in lalsimulation in SimIMRPhenomXPMSAAngles.

Zero in last index of generated waveforms

LS.SimInspiralFD() seems to output 0 for h(f=f_max). The effect on parameter estimation should be tiny, but it would be good to figure out (a) why this is the case, and (b) whether other parameter estimation codes account for it.

Sample arguments for testing:

(5.598355378838895e+31,
 2.9423587234023285e+31,
 0.2641853402567754,
 0.18856617361887903,
 -0.06428798209990938,
 0.34909791843015114,
 0.11681092714767329,
 0.35296842762971853,
 3.085677581491367e+24,
 0.8778096525449693,
 2.881322091505041,
 0.0,
 0.0,
 0.0,
 0.125,
 20.0,
 512.0,
 20.0,
 None,
 79)

Fixing random seeds

It would be good to provide functionality to fix the random seeds used for generating training datasets, training networks, and sampling. This will help to ensure reproducibility of results.

One way might be to specify the seed in the various config files (or on the command-line for inference).

Threading during training

When training with num_workers>0, in addition to the worker processes (each at 100% CPU, 1 thread) the main process has a huge number of threads, and uses roughly 900% CPU. I wonder if this multithreading is contributing to the irregular iteration times.

Fix test output of reduced basis

When generating the SVD for (a) compression of the dataset and (b) initialisation of the first network layer, we need to make sure that the basis is sufficiently accurate. The method SVDBasis.test_basis() is intended for this purpose, however at the moment it is pretty messy with print statements and saved data. Need to clean this up.

The test_basis method should maybe have two modes:

The minimal test, testing only with the maximal number of basis elements. It could either print the mismatches, or just check whether they are below a particular threshold. This should be run whenever generating and using a reduced basis, i.e. for (a) and for (b).
A detailed test that allows for better analysis. E.g., when generating a dataset with a new waveform model, or a new prior, one does not know how many basis elements are necessary. The detailed test should scan the mismatches with a variety of n_svd (default could e.g. be n_svd in [2**i for i in range(5, i_max)]). In addition, there should be an option to save the array of mismatches along with the parameters. This would help to analyse in which part of the parameter space the reduced basis fails. It would also be necessary to determine what prior on particular parameters (e.g., chirp mass) is feasible for a specific n_svd.

Profiling

Profiling the duration of one __getitem__ call of the WaveformDataset class.
Times in [s] averaged over multiple runs

Total time: ~ 3 x 10^(-3)

Matrix multiplication time: ~ 7 x 10^(-4)
Time for all transformations: ~ 2 x 10^(-3)

Individual time of each transformation:

SampleExtrinsicParameters: ~1.1 x 10^(-4)
GetDetectorTimes: ~1.1 x 10^(-4)
GNPEDetectorTimes ~ 2 x10^(-5)
ProjectOntoDetectors: ~ 7.5 x 10^(-4)
SampleNoiseASD: ~ 5 x10^(-5)
WhitenAndScaleStrain: ~ 2 x10^(-5)
AddWhiteNoiseComplex: ~ 7.1 x10^(-4)
SelectStandardizeRepackageParameters: ~ 4 x10^(-5)
RepackageStrainsAndASDS: ~ 4 x10^(-5)
UnpackDict: ~ 1 x10^(-6)

Expensive calls in ProjectOntoDetectors:

ifo.antenna_response (called 2 x num_detectors times with both polarizations): 1.3 x 10^(-4) for both polarizations, but the runtime is different depending on the polarization:
Plus polarization Antenna response time for H1: 0.00010442733764648438
Cross polarization Antenna response time for H1: 3.361701965332031e-05
Plus polarization Antenna response time for L1: 8.630752563476562e-05
Cross polarization Antenna response time for L1: 3.24249267578125e-05
time_translate_data: ~ 1.5 x 10^(-4) (called num_detector times)

Expensive calls in AddWhiteNoiseComplex:

np.random.normal: 1.4 x 10^(-4) (called 2 x num_detectors times with size 8033)

Rename PosteriorModel class to something more generic

We currently use the PosteriorModel class in dingo.core.models.posterior_model.py for more than just the posterior model. It is a generic density model, which can be used for conditional density estimation (which includes NPE), but also for unconditional density estimation, as done e.g. for importance sampling. It should thus have a more generic name. Maybe DensityModel?

Todos:

Rename PosteriorModel Class
Update Docstring
Functionality: I took care of most of the implementation already, to use the model for unconditional density estimation for importance sampling. The DataLoader should return the data in the form (parameters, context_1, context_2, ..., context_n). For NPE n=1, for GNPE n=2 and for unconditional models n=0 (as e.g. implemented by dingo.core.density.density_estimation.SampleDataset). Training and testing is already compatible with this, but not yet the PosteriorModel.sample() method.

In general, the PosteriorModel class should take care of saving and loading, of training and testing, and of sampling. This should be pretty generic, and it does not need to know whether it is a posterior model or and unconditional model. This is why the above changes make sense.

Detector Calibration Uncertainty

For PE review a requirement is including detector calibration uncertainty. This could be implemented via using a similar setup to RIFT which does importance sampling on the calibration weighted data. The code is found here (LIGO credentials required).

The idea right now is to add an optional transform to dingo.gw.inference.injection GWSignal._initialize_transform so that the multiplicative calibration envelope can be applied

https://git.ligo.org/ethan.payne/generic-calibration-reweighting

Streamlining WaveformDataset

`WaveformDataset`

I am still not entirely happy with the way the WaveformDataset works, basically because it is used for several unrelated tasks depending on context. This is reflected in the fact that the __init__ function has several optional arguments (a dataset file, a prior, a waveform generator, and a transform). So sometimes one uses a waveform generator to build a dataset, and other times loads a waveform dataset from file and applies transforms to the waveforms. The class methods also seem to fall into several groups depending on the use case of the class, and they do not really interact with each other across use case. This suggests that WaveformDataset should either (1) be split into two classes (one for generating a dataset, one for loading), or (2) slimmed down into one smaller class (loading) and a set of functions for generating the dataset.

Based on how the "tutorial" for generating waveforms is progressing, option (2) seems most natural. Many of the tutorial functions merely wrap WaveformDataset methods, which in turn are only called once by these functions. So it would make sense to fully move this method functionality into the tutorial functions. One would then be left with a much smaller WaveformDataset class, which would only be used for loading data and feeding it to the dataloader. I think this would really streamline these aspects of the code.

Example

The method generate_dataset is called twice in the code: in tutorials/02_gwpe/generate_waveforms.py: generate_waveforms and in gw.transforms.get_parameter_list. The latter is only used to obtain a list of parameters, not to actually produce a dataset of waveforms. So this functionality could easily be moved directly into generate_waveforms in the tutorial (which already works directly with the waveform generator and prior).

Minor comments:

We might want to provide a way for a user to interactively generate a waveform dataset, i.e., consolidate all of the various scripts in the tutorial via a single function to be run on one machine. (This would be a third way of running it.)

Memory usage of merge_datasets

The merge_datasets() function is used to combine datasets generate over separate condor jobs. It first loads the datasets, then combines them using np.vstack(). This approach requires double the memory of the total dataset, so it would be better to redesign this to be more efficient.

Comparison with research code

Investigate small differences in waveforms produced by research code and dingo code, particularly with regard to projections onto detectors.

Add conversion from primed to unprimed spins

Synthetic phase uses phase = 0 to compute the cartesian spins. We thus need to add a postprocessing operation to account for this.

Automate number of gnpe iterations/ monitor convergence

Dingo should have the option so set num_gnpe_iterations = None, in which case the number of iterations is determined by the convergence of the proxies. A useful criterion could be convergence of mean and std of the GNPE proxies. Ideally, one would use JSD, but this is likely too slow.

Alternatively, when setting num_gnpe_iterations to a fixed number, dingo should at least provide some convergence statistics.

standardisations for detector times

For GNPE, we need an initialisation model for the detector times. I.e., we need to train a model with

data:
  selected_parameters: [H1_time, L1_time]

This does not work at the moment, since the standardisations for these parameters are not computed. They need to be added.

Comparison of dingo detector projection to research code

The unit test test_detector_projection_agains_research_code in test_transforms.py compares the detector projection of dingo against that of the research code (non-whitened and non-noisy). We find a relative deviation of 1e-3 between these. I suspect that this is due to (i) using bilby instead of pycbc routines and (ii) using float64 instead of float32 for detector time computation and antenna projection. Please double check that this makes sense.

Add extra polarizations

This is not urgent, but we should allow for breathing, longitudinal, x, and y modes in addition to plus and cross. The Bilby Interferometers can provide antenna patterns for all of these, and they would be straightforward to incorporate.

Create cloud storage for data

We should store some crucial data (e.g., ASD datasets, trained models) in a cloud, and add the corresponding URLs + download functions to dingo. One would access the data via something like dingo.gw.download_data(type='ASDDataset', observing_run='O1'). This would also make it easier for us to share the data, since it is more convenient than manually using a dropbox.

For dingo-enets (https://pypi.org/project/dingo-enets/) I used google drive, since it is free (15GB) and there are libraries for downloading data from it. Stephen said he'd prefer something more professional, so I am open to suggestions for alternatives.

Downlaoding and gerenating PSD datasets

Currently, the PSD segment lists need to be downloaded manually and a sequence of scripts need to be called in order to come up with the final dataset that can be used.

This should be simplified. Furthermore, this will make it easier to add functionality to generate custom PSD datasets as used in the PSD project.

Multiprocessing in waveform dataset generation

The script dingo_generate_dataset does three things:

Generates a dataset of waveforms for building an SVD basis.
Builds the SVD basis using sklearn.randomized_svd().
Generates a dataset of waveforms compressed to the SVD basis.

Steps 1 and 3 are supposed to be parallelized using a multiprocessing Pool. However, this only seems to work in step 1. (Running top shows the specified number of processes, each around 100% CPU.) During step 3 (on Linux), step 3 only shows one process for a long time, and then it starts multiprocessing, but with >> 100% CPU.

I think this is a problem with step 2: The randomized_svd() process uses >> 100% CPU (which it really shouldn't since we set num_threads=1), and somehow multiprocessing doesn't work properly after that. This only seems to affect Linux, not Mac.

Progress bar for dataset generation

It would be nice to have some sort of progress bar for waveform dataset generation, since this can take some time.

I think there might be a version of tqdm that works with multiprocessing. This could work for single-machine dataset generation. I'm not sure of a best approach for the DAG version.

Making waveform generation handle LAL failures in a robust way

Introduce a try / except statement when the LAL generator is called and in the except clause return NaNs. This implies that we'll have to deal with these failed cases at the stage where we compute the basis and / or where we collect all the data to make sure it is sane.

dingo-gw / dingo Goto Github PK

dingo's Introduction

Dingo

Installation

Pip

Conda

Development install

Usage

References

Contact

dingo's People

Contributors

Stargazers

Watchers

Forkers

dingo's Issues

Extensions

DINGO use cases

Train NDE

Perform inference with NDE on single GW event

Extensions

Support non-uniform frequency grids

Add flexibility about which parameters to sample in

Support prior specifications similar to bilby?

output additional quantities

Support subdomains in parameter space

Marginalization over detector calibration

Non-Gaussian noise

SEOBNRv4PHM

Flexibility to support ground based detectors and LISA

Flexibility to change the flows

Have sane default parameters

Look into PSD reweighting

Improve load balancing between GPU and CPU

Support multiple GPUs

settings.yaml:

train_settings.yaml:

LinearProjectionRB

testutils

test_enets

README

Code quality checking

Continuous integration

Documentation

Sample arguments for testing:

WaveformDataset

Example

Minor comments:

Recommend Projects

Recommend Topics

Recommend Org

`WaveformDataset`