gwastro / ml-mock-data-challenge-1 Goto Github PK

Code and documentation for the first machine learning focused mock data challenge hosted by the Albert-Einstein-Institut Hannover and the Friedrich-Schiller Universität Jena.

License: Apache License 2.0

Python 13.46% Jupyter Notebook 83.74% PureBasic 2.80%

ml-mock-data-challenge-1's Introduction

MLGWSC-1 - Machine Learning Gravitational-Wave Search (Mock Data) Challenge

Introduction

Welcome to the first machine learning gravitational-wave search mock data challenge hosted by the Albert-Einstein-Institut Hannover and the Friedrich-Schiller Universität Jena. In this challenge participants are tasked with finding gravitational-wave signals of varying complexity in a noisy background. Entries are evaluated on metrics which are used to classify the performance of real-world, state-of-the-art search algorithms.

The goal of this challenge is to create a collaborative publication that collects state-of-the-art machine learning based gravitational-wave search algorithms and enables a comparison to classical approaches such as matched filtering or coherent burst searches. Through this, we strive to highlight the advantages of different entries for specific tasks and want to pinpoint areas where further research seems fruitful.

Because this is a collaborative work, all teams that submit an algorithm and choose not to retract it before final publication will gain co-authorship. We nonetheless encourage publications on the individual algorithms to describe details of pre-processing, post-processing, training, etc. We, furthermore, encourage the publication of the source code used for training and evaluation to foster reproducability. However, open source code is not required for submission.

Although this challenge is focused on machine learning approaches, we do accept submissions which do not make use of this relatively new area of research.

If you want to partipate in this mock data challenge, please get in contact with us by sending a mail to [email protected]. We accept registrations up to a maximum number of 30 participating groups until ~~December 31st, 2021~~ (We have remaining capacity. Please get in touch if you would still like to participate). The deadline for the final submission of the algorithm is March 31st, 2022.

On submission, we will evaluate your algorithm on a validation set. The performance on this validation set will then be reported back to you to check that the algorithm behaves as expected. Once we have confirmation by the group that the algorithm performs within the expected margins of error, we will evaluate the submission on a secret test set that is the same for all entries. The performance on this set will only be reported back to the groups on the first circulation of the publication draft. Submissions may be retracted at any point prior to final publication of the manuscript. For more information please refer to this page.

Contents of this Repository

This repository contains source code to generate data of the kind that will be used for final evaluation as well as the source code that will be used to carry out this final evaluation. It also contains a few configuration files that are required for data generation.

Submissions must be able to process a file of HDF5 format that contains the raw strain data for 2 detectors. Any required pre-processing is expected to be performed by the submitted code. The output is expected to be another file of HDF5 format which contains times, ranking-statistic like values, and timing accuracies for candidate events. The ranking-statistic like values are numbers where a larger value is supposed to correspond to a larger probability of an astrophysical event to be present. For details on the input- and output-format please refer to the Wiki of this repository.

Requirements

To run the code you need to have a working installation of Python 3.7 or higher. You will then need to install dependencies using

pip install -r requirements.txt

This installs a version of the PyCBC github that was tested and confirmed to be working. Older versions may be missing required functions.

For more detailed installation instructions please refer to this page.

Citation

If you make use of the code in this repository please cite it accordingly. Please cite as

@misc{https://doi.org/10.48550/arxiv.2209.11146,
    doi = {10.48550/ARXIV.2209.11146},
    url = {https://arxiv.org/abs/2209.11146},
    author = {Schäfer, Marlin B. and Zelenka, Ondřej and Nitz, Alexander H. and Wang, He and Wu, Shichao and Guo, Zong-Kuan and Cao, Zhoujian and Ren, Zhixiang and Nousi, Paraskevi and Stergioulas, Nikolaos and Iosif, Panagiotis and Koloniari, Alexandra E. and Tefas, Anastasios and Passalis, Nikolaos and Salemi, Francesco and Vedovato, Gabriele and Klimenko, Sergey and Mishra, Tanmaya and Brügmann, Bernd and Cuoco, Elena and Huerta, E. A. and Messenger, Chris and Ohme, Frank},
    keywords = {Instrumentation and Methods for Astrophysics (astro-ph.IM), High Energy Astrophysical Phenomena (astro-ph.HE), Machine Learning (cs.LG), General Relativity and Quantum Cosmology (gr-qc), FOS: Physical sciences, FOS: Physical sciences, FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {MLGWSC-1: The first Machine Learning Gravitational-Wave Search Mock Data Challenge},
    publisher = {arXiv},
    year = {2022},
    copyright = {arXiv.org perpetual, non-exclusive license}
}

ml-mock-data-challenge-1's People

Contributors

Stargazers

Watchers

Forkers

pascal-mueller alexandra1120 ai-hpc-research-team nnarenraju parth92 homemanisaac 00mjk sgangoly cdl2020-week2-group4 zhangyuewei98 jajohnson51 alecgunny akelekog steven-space

ml-mock-data-challenge-1's Issues

injections.hdf contains events that are very close to each other

I took the "tc" values from injections.hdf, sorted them and found the difference between adjacent times. The minimum time duration between two adjacent "tc" is ~0.00019 seconds, and only 145 out of 583,847 injections have a difference of fewer than 6 seconds. All others are 24 seconds or above. Is this intentional?

Code snippet used to check the time difference:
diff = np.sort(sorted_tc[1:] - sorted_tc[:-1])

Same noise generated by generate_data.py for both detectors

The noise generated by generated_data.py for dataset 1 seems to be the same for both detectors, as checked by

np.allclose(rawdata[0], rawdata[1], rtol=1e-45, atol=1e-36)

in example_torch.py#L93

Memory Leak in h5py v.3.4.0

Hello,

it seems that v3.4.0 of h5py has a memory leak: https://issueexplorer.com/issue/h5py/h5py/1975

Solution: Update to 3.6.0 or downgrade to 3.3.0.

Two self-referecing links in data set wiki

Hello,

There are two links on this page https://github.com/gwastro/ml-mock-data-challenge-1/wiki/Data-Sets which link to the page itself.

Memory Leak in Approximant used in Example

Hello,

This issue is just a notification.

It seems that the approximant IMRPhenomD has a memory leak i.e. it will use more memory the longer it runs. If anyone is a ligo member, feel free to test it and make an issue on their repo. I can't.

Whitening fails for specific dataset durations

Running the PyTorch example script over a dataset generated by generate_data.py with a duration of 600 seconds yields the following traceback:

Traceback (most recent call last):
  File "/data/ondzel/miniconda3/envs/mlgwsc/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/ondzel/miniconda3/envs/mlgwsc/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/ondzel/repos/ml-mock-data-challenge-1/examples/example_torch.py", line 624, in <module>
    main()
  File "/data/ondzel/repos/ml-mock-data-challenge-1/examples/example_torch.py", line 609, in main
    verbose=args.verbose)
  File "/data/ondzel/repos/ml-mock-data-challenge-1/examples/example_torch.py", line 533, in get_triggers
    for slice_batch, slice_times in iterable:
  File "/data/ondzel/miniconda3/envs/mlgwsc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/data/ondzel/miniconda3/envs/mlgwsc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/data/ondzel/miniconda3/envs/mlgwsc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/ondzel/miniconda3/envs/mlgwsc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/ondzel/repos/ml-mock-data-challenge-1/examples/example_torch.py", line 323, in __getitem__
    next_slice, next_time = Slicer.__getitem__(self, index)
  File "/data/ondzel/repos/ml-mock-data-challenge-1/examples/example_torch.py", line 116, in __getitem__
    dat, t = self.generate_data(key, idxs)
  File "/data/ondzel/repos/ml-mock-data-challenge-1/examples/example_torch.py", line 100, in generate_data
    ts = ts.whiten(0.5, 0.25, low_frequency_cutoff=18.)
  File "/data/ondzel/miniconda3/envs/mlgwsc/src/pycbc/pycbc/types/timeseries.py", line 622, in whiten
    white = (self.to_frequencyseries() / psd**0.5).to_timeseries()
  File "/data/ondzel/miniconda3/envs/mlgwsc/lib/python3.7/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/data/ondzel/miniconda3/envs/mlgwsc/src/pycbc/pycbc/types/array.py", line 66, in _convert
    return fn(self, *args)
  File "/data/ondzel/miniconda3/envs/mlgwsc/lib/python3.7/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/data/ondzel/miniconda3/envs/mlgwsc/src/pycbc/pycbc/types/array.py", line 265, in _checkother
    check_same_len_precision(self, other)
  File "/data/ondzel/miniconda3/envs/mlgwsc/src/pycbc/pycbc/types/array.py", line 128, in check_same_len_precision
    raise ValueError(msg)
ValueError: lengths do not match (1281 vs 1280)

Some other dataset durations, namely 660, 86400 and 2592000 seconds, do not share this issue.

ImportError when running generate_data.py after installing current version of PyCBC

I created a new venv and installed PyCBC again using the provided pip install command.
I encountered the following error when I tried to run the generate_data.py file.
This error does not occur if I run the same code using my old venv with the previous version of PyCBC.

Traceback (most recent call last):
File "./generate_data.py", line 22, in
from pycbc.inject import InjectionSet
File "/home/nnarenraju/venv/pycbc_experiment_check/src/pycbc/pycbc/inject/init.py", line 2, in
from pycbc.inject.inject import *
File "/home/nnarenraju/venv/pycbc_experiment_check/src/pycbc/pycbc/inject/inject.py", line 37, in
from pycbc import waveform
File "/home/nnarenraju/venv/pycbc_experiment_check/src/pycbc/pycbc/waveform/init.py", line 3, in
from pycbc.waveform.bank import *
File "/home/nnarenraju/venv/pycbc_experiment_check/src/pycbc/pycbc/waveform/bank.py", line 40, in
import pycbc.io
File "/home/nnarenraju/venv/pycbc_experiment_check/src/pycbc/pycbc/io/init.py", line 3, in
from .hdf import *
File "/home/nnarenraju/venv/pycbc_experiment_check/src/pycbc/pycbc/io/hdf.py", line 23, in
from pycbc.io.ligolw import return_search_summary, return_empty_sngl
File "/home/nnarenraju/venv/pycbc_experiment_check/src/pycbc/pycbc/io/ligolw.py", line 23, in
from ligo.lw.types import FormatFunc, FromPyType, IDTypes, ToPyType
ImportError: cannot import name 'IDTypes' from 'ligo.lw.types' (/home/nnarenraju/venv/pycbc_experiment_check/lib/python3.7/site-packages/ligo/lw/types.py)

I apologise if this is due to an error on my end.

OOM crash of generate_data.py

Running generate_data.py --duration 2592000 on the ARA cluster, even with 128 GB of memory allocated the script crashes due to insufficient memory once 743300 seconds of data have been saved in the background output file, irrespective of the -d option. Adding some garbage collection seems to be helpful, probably a memory leak.

Wrong sample rate for certain generated durations

This issue was pointed out to me by @pascal-mueller. When generating data with the generate_data.py script and choosing a duration of 600 seconds the sample rate of the data is incorrect.

To reproduce one can run

./generate_data.py -d 1 -b tmpbg.hdf --duration 600

The sample rate of the resulting data is slightly off, as can be checked by

>>> import h5py
>>> with h5py.File('tmpbg.hdf', 'r') as file:
>>>     sample_rate = 1. / file[f'H1/1238205077'].attrs['delta_t']
>>> sample_rate
    2047.9999999999995

This issue is caused by numerical precision limitations. For a given requested duration one can check if the sample rate will be off by evaluating 1 / (1 / (duration + 256.)) == duration + 256.. If the result is True then the issue should not appear, otherwise it should.

gwastro / ml-mock-data-challenge-1 Goto Github PK

ml-mock-data-challenge-1's Introduction

MLGWSC-1 - Machine Learning Gravitational-Wave Search (Mock Data) Challenge

Introduction

Contents of this Repository

Requirements

Citation

ml-mock-data-challenge-1's People

Contributors

Stargazers

Watchers

Forkers

ml-mock-data-challenge-1's Issues

Recommend Projects

Recommend Topics

Recommend Org