jpata / particleflow Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 27.0 155.14 MB

Machine-learned, GPU-accelerated particle flow reconstruction

License: Apache License 2.0

Jupyter Notebook 70.44% Python 23.60% Shell 4.22% Tcl 0.79% Makefile 0.04% C++ 0.57% Batchfile 0.33%

particleflow's People

Contributors

Stargazers

Watchers

particleflow's Issues

pre-commit hook to run black, flake8, isort, etc.

very low priority, but we could add a pre-commit hook to make run black, flake8, isort, etc on the repo to enforce common style and minimal diffs.

It could also run automatically to lint PRs, etc..

pre-commit yaml: https://github.com/FAIR4HEP/hbb_interaction_network/blob/main/.pre-commit-config.yaml
github action to lint PRs: https://github.com/FAIR4HEP/hbb_interaction_network/blob/main/.github/workflows/pre-commit.yml

fix up git LFS

We are out of LFS space, because at some point, I put weight files on github using LFS, but they are there even after removing.
According to the github documentation, the solution is to delete/recreate the repo, but I'd like to avoid it if possible.

Downloading models/acat2022_20221004_model40M/cms-gen_20220923_163529_426249.gpu0.local/logs/train/events.out.tfevents.1663940144.gpu0.local.2696240.0.v2 (3.0 MB)
Error downloading object: models/acat2022_20221004_model40M/cms-gen_20220923_163529_426249.gpu0.local/logs/train/events.out.tfevents.1663940144.gpu0.local.2696240.0.v2 (521a8e0): Smudge error: Error downloading models/acat2022_20221004_model40M/cms-gen_20220923_163529_426249.gpu0.local/logs/train/events.out.tfevents.1663940144.gpu0.local.2696240.0.v2 (521a8e0dd0f705506862114f11c9856f82bae55543a94041dd0665eea5183cb6): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /home/joosep/test/particleflow/.git/lfs/logs/20230915T104049.046923961.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: models/acat2022_20221004_model40M/cms-gen_20220923_163529_426249.gpu0.local/logs/train/events.out.tfevents.1663940144.gpu0.local.2696240.0.v2: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

try fast feedforward

https://github.com/pbelcak/fastfeedforward.

Read the paper
Does it speed up wrt. layer width or wrt. number of elements per event?
Quick test for under which conditions a simple model benefits from it
Full test in our code, replacing the pointwise-feedforward layers

try contrastive event-based loss

Currently we are using a simple matching-based loss based on individual particles/elements.

It might be interesting to try a contrastive event-based loss, such as the one proposed in this paper.
https://openaccess.thecvf.com/content/CVPR2023/papers/Huang_Learning_To_Measure_the_Point_Cloud_Reconstruction_Loss_in_a_CVPR_2023_paper.pdf

create a short documentation on training the CLIC hit-based model on the 10k event dataset

Put this in the tfds download instructions:
https://zenodo.org/record/8414225

Add some instructions on how to train the hit-based model here:
https://github.com/jpata/particleflow/blob/main/README_tf.md

delphes test not running due to slow zenodo

the files fail to download from zenodo today.

wget --no-check-certificate -nc https://zenodo.org/record/4559324/files/tev14_pythia8_ttbar_0_0.pkl.bz2
tev14_pythia8_ttbar_0_0.pkl.bz2  1%[>        ] 486.98K  35.9KB/s    eta 22m 21s

unsure if this is a temporary issue or new rate limitations.

CMS inference with GPU

Once cms-sw/cmssw#36963 is available, investigate enabling the MLPFProducer in CMSSW to use GPUs for inference.
This would allow to do an apples-to-apples comparison of PF vs MLPF, given a fully loaded machine (CPU+GPU).

possible recent slowdown in the TF model

Introduction of the 2D hist in the dataset loop and model output in #114 looks like to be a likely culprit.
Should be fixed in #123.

implement ONNX export in pytorch

ONNX export of GNN-LSH from pytorch
ONNX import to CMSSW
compatibility of the results

TFDS dataset generation fails for qcd_high_pt in gen-jet clustering

This tfds generation failed at some point halfway through.

tfds build hep_tfds/heptfds/cms_pf/qcd_high_pt
...
Traceback (most recent call last):                              
  File "/usr/local/bin/tfds", line 8, in <module>                       
    sys.exit(launch_cli())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/scripts/cli/main.py", line 102, in launch_cli
    app.run(main, flags_parser=_parse_flags)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/scripts/cli/main.py", line 97, in main
    args.subparser_fn(args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/scripts/cli/build.py", line 192, in _build_datasets
    _download_and_prepare(args, builder)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/scripts/cli/build.py", line 342, in _download_and_prepare
    builder.download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 481, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 1218, in _download_and_prepare
    future = split_builder.submit_split_generation(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/core/split_builder.py", line 310, in submit_split_generation
    return self._build_from_generator(**build_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/core/split_builder.py", line 371, in _build_from_generator
    for key, example in utils.tqdm(
  File "/usr/local/lib/python3.8/dist-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/joosep/particleflow/hep_tfds/heptfds/cms_utils.py", line 191, in generate_examples
    X, ygen, ycand = prepare_data_cms(str(fi), pad_size)
  File "/home/joosep/particleflow/hep_tfds/heptfds/cms_utils.py", line 162, in prepare_data_cms
    jet_constituents = [index_mapping[idx] for idx in constituent_idx[jet_idx]] # map back to constituent index *before* masking
  File "/home/joosep/particleflow/hep_tfds/heptfds/cms_utils.py", line 162, in <listcomp>
    jet_constituents = [index_mapping[idx] for idx in constituent_idx[jet_idx]] # map back to constituent index *before* masking
IndexError: index 4659 is out of bounds for axis 0 with size 4659

The other ones (ttbar, ztt, qcd) succeeded, so I'm debugging this a bit.

GNN LSH in pytorch

Implement the GNN LSH in pytorch.

heterogeneous graphs

For the moment, we concatenate input elements of different types (tracks, clusters, ...) into a single feature matrix. This means that there are features that may be defined for one type, but not defined for other types. Our ML model downstream treats all input elements as nodes of the same type. In principle, splitting up the inputs from a single feature matrix to independent feature matrices for each element type should be better.

try the MDMM loss

I believe Michael has a working version for the SSL studies - let's try it out in the codebase.

end to end PF regression

Baseline end-to-end training with PFCandidate-to-element matching #12
address class imbalance
differentiable EMD approximation: #13
tensorflow implementation of training for CMSSW
- spektral, DGL?

add loss term wrt. average local energy

it might be useful to try to improve the reconstruction of event-level quantities by computing and comparing local averages of energies for each true/reconstructed particle.

try FlatFormer

Provide an efficient GNN inference implementation using sparsification/quantization with ONNX

Goal: reduce CPU inference time with the ONNX backend

We made some CPU inference performance results public for 2021 in CMS, https://cds.cern.ch/record/2792320/files/DP2021_030.pdf slide 16, “For context, on a single CPU thread (Intel i7-10700 @ 2.9GHz), the baseline PF requires approximately (9 ± 5) ms, the MLPF model approximately 320 ± 50 ms for Run 3 ttbar MC events”.

Now it's a good time to make the ONNX inference as fast as possible, while minimizing any physics impact.

Resources:

try optimized kNN kernels

Try if we can use these kNN kernels in pytorch:
https://github.com/tklijnsma/pytorch_cmspepr/tree/main/csrc

CUDA optimization prevents running on older cards

https://github.com/jpata/particleflow/blob/main/mlpf/pipeline.py#L244-L253
This fails on Nvidia 1080, perhaps it needs to be enabled selectively, or with a different value on different cards.

Upload singularity image to cvmfs

Kenneth suggested uploading the singularity image to cvmfs:

Currently the latest TF+nvidia image is here: https://hep.kbfi.ee/~joosep/tf-2.13.0.simg
The pytorch image is here: https://hep.kbfi.ee/~joosep/pytorch.simg

differentiable EMD

Implement approximate EMD loss to compare predicted particle set to target, as in point cloud regression.

https://arxiv.org/pdf/1612.00603.pdf section 4.3
https://github.com/gpeyre/SinkhornAutoDiff/blob/master/sinkhorn_pointcloud.py

First try at https://github.com/jpata/particleflow/blob/endtoend_gnn/test/sinkhorn_pointcloud.py, need to test.

element clustering on GPU a la HGCAL

I saw this nice presentation from CHEP2019 about doing HGCAL clustering on a GPU: https://indico.cern.ch/event/773049/contributions/3473275/attachments/1936902/3210151/chep2019.pdf

Seems like this could also be repurposed for clustering the particle flow elements.

PFNet with TF multi-GPU works with 2 and 5 GPUs, but not with 4

This works fine:

CUDA_VISIBLE_DEVICES=5,6,7,8,9 singularity exec --nv /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train

...

Model: "pf_net"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_encoding (InputEncodin multiple                  0
_________________________________________________________________
sparse_hashed_nn_distance (S multiple                  17825
_________________________________________________________________
gnn_id (EncoderDecoderGNN)   multiple                  2375680
_________________________________________________________________
sequential_2 (Sequential)    (5, 6400, 8)              805384
_________________________________________________________________
sequential_3 (Sequential)    (5, 6400, 1)              801793
_________________________________________________________________
gnn_reg (EncoderDecoderGNN)  multiple                  2379776
_________________________________________________________________
sequential_4 (Sequential)    (5, 6400, 5)              807941
=================================================================
Total params: 7,188,399
Trainable params: 7,185,199
Non-trainable params: 3,200
________________________________
Epoch 1/500
 258/3200 [=>............................] - ETA: 19:18 - loss: 92.0839 - charge_loss: 15.1783 - cls_loss: 29.7701 - cos_phi_loss: 14.1853 - energy_loss: 26.3809 - eta_loss: 51.6222 - pt_loss: 3.7117 - sin_phi_loss: 11.3279 - cls_acc_unweighted: 0.7688

and so does

CUDA_VISIBLE_DEVICES=5,6 singularity exec --nv /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train
...
Epoch 1/500
  31/8000 [..............................] - ETA: 1:03:18 - loss: 124.7493 - charge_loss: 16.4039 - cls_loss: 42.0418 - cos_phi_loss: 17.1479 - energy_loss: 32.1395 - eta_loss: 130.0834 - pt_loss: 7.7385 - sin_phi_loss: 11.0048 - cls_acc_unweighted: 0.6926

while this doesn't :

CUDA_VISIBLE_DEVICES=5,6,7,8 singularity exec --nv -B /home /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train

...

Traceback (most recent call last):
  File "mlpf/launcher.py", line 26, in <module>
    main(args, yaml_path, config)
  File "/home/joosep/particleflow/mlpf/tfmodel/model_setup.py", line 755, in main
    fit_result = model.fit(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 555, in call
    outputs = execute.execute(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 5 root error(s) found.
  (0) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_2/pf_net/sparse_hashed_nn_distance/map/while/body/_1664/replica_2/pf_net/sparse_hashed_nn_distance/map/while/map/while/cond/_8452/replica_2/pf_net/sparse_hashed_nn_distance/map/while/map/while/Less_1/_823]]
  (1) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_1/pf_net/sparse_hashed_nn_distance/SparseTensor/dense_shape/_864]]
  (2) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_1/pf_net/sparse_hashed_nn_distance/SparseTensor/dense_shape/_863]]
  (3) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[pf_net/gnn_reg/StatefulPartitionedCall/conv_reg1/map/while/body/_3801/conv_reg1/map/while/SparseReshape/_2974]]
  (4) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_335632]

Function call stack:
train_function -> train_function -> train_function -> train_function -> train_function

tensorflow dataset iteration is slow when IO-bound in a single process

Evaluating the dataset number of steps in tensorflow is currently slow when the loop is IO-bound in a single process, because we use
tf.data.Dataset.from_generator, which uses Python underneath and doesn't release the GIL.

See here: https://github.com/jpata/particleflow/blob/58001fa7d850c20b7d50d696478926ab9be8a41f/mlpf/tfmodel/datasets/BaseDatasetFactory.py#L79C1-L84

It might require some changes upstream in tfds to support ArrayRecordDataSource.as_dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/file_adapters.py#L188

add a loss term wrt. genjet clustering

We want to improve jet and MET resolution of MLPF, but these cannot directly be written into the loss function.
It was suggested by JD that we could cluster the gen-particles to jets and save the jet clustering index for each particle.

Since we have this one-to-one matching between genparticles, input elements and predicted particles, we can add a loss term where we compare jets from the predicted particles based on the genparticle clustering index to jets from the genparticles.

For the TF model, this would mean running fastjet at the tfds stage, saving the jet clustering index for each particle, propagating it to the loss function and adding an additional term.

Quantized graph networks

We need the GNN to be fast - quantization at inference time would be useful.

Here are some references:

Degree-Quant: Quantization-Aware Training for Graph Neural Networks, ICML2020

@aaditep has been working on it

switch to tfds data loading for the pytorch backend

Pytorch+tfds now support loading data from the ArrayRecordDataSource, so we can use the same efficient ML input data for both pytorch and tensorflow.

Here's some example code for loading the dataset in pytorch.

#!/usr/bin/env python
# coding: utf-8

import os
import numpy as np

import random
import itertools
import tqdm

import tensorflow_datasets as tfds
import torch
import torch.nn as nn
from torch import Tensor

import ray
from ray.air import session, Checkpoint
from ray import train
from ray.train.torch import TorchTrainer, TorchConfig
from ray.air.config import ScalingConfig
from ray.air.config import RunConfig
from ray.air.config import CheckpointConfig
import ray.data


# In[38]:


def collate_padded_batch(inputs):
    num_samples_in_batch = len(inputs)
    elem_keys = list(inputs[0].keys())
    ret = {}
    for elem_key in elem_keys:
        batch = [Tensor(i[elem_key]) for i in inputs]
        max_seq_len = max([x.shape[0] for x in batch])
        padded_batch_data = [nn.functional.pad(x, (0, 0, 0, max_seq_len - x.shape[0])) for x in batch]
        ret[elem_key] = torch.stack(padded_batch_data, axis=0)
    ret["mask"] = ret["X"][:, :, 0] == 0
    return ret

def my_getitem(self, vals):
    records = self.data_source.__getitems__(vals)
    return [self.dataset_info.features.deserialize_example_np(record, decoders=self.decoders) for record in records]

class Dataset:
    def __init__(self, name="clic_edm_ttbar_pf:1.5.0", split="train"):
        builder = tfds.builder(name, data_dir="/home/joosep/tensorflow_datasets/")

        self.ds = builder.as_data_source(split=split)

        #to prevent a warning from tfds about accessing sequences of indices
        self.ds.__class__.__getitems__ = my_getitem

    def get_sampler(self):
        sampler = torch.utils.data.SequentialSampler(self.ds)
        return sampler

    def get_loader(self, batch_size=20, num_workers=0, prefetch_factor=None):
        return torch.utils.data.DataLoader(
            self.ds,
            batch_size=batch_size,
            collate_fn=collate_padded_batch,
            sampler=self.get_sampler(),
            num_workers=num_workers,
            prefetch_factor=prefetch_factor
        )

    def __len__(self):
        return len(self.ds)

    def __repr__(self):
        return self.ds.__repr__()

from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

class TransformerModel(nn.Module):

    def __init__(self,
                 d_in: int,
                 d_out: int,
                 d_model: int,
                 nhead: int,
                 d_hid: int,
                 nlayers: int,
                 dropout: float = 0.5):
        super().__init__()

        self.linear_in = nn.Linear(d_in, d_model)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout, batch_first=True)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.linear_out = nn.Linear(d_model, d_out)

    def forward(self, src: Tensor, src_key_padding_mask: Tensor) -> Tensor:
        src = self.linear_in(src)
        output = self.transformer_encoder(src, src_key_padding_mask=src_key_padding_mask)
        output = self.linear_out(output)
        return output

class InterleavedIterator(object):
    def __init__(self, data_loaders):
        self.idx = 0
        self.data_loaders_iter = [iter(dl) for dl in data_loaders]
        max_loader_size = max([len(dl) for dl in data_loaders])

        #interleave loaders of different length
        self.loader_ds_indices = []
        for i in range(max_loader_size):
            for iloader, loader in enumerate(data_loaders):
                if i<len(loader):
                    self.loader_ds_indices.append(iloader)

        self.cur_index = 0

    def __iter__(self):
        return self

    def __next__(self):
        iloader = self.loader_ds_indices[self.cur_index]
        self.cur_index += 1
        return next(self.data_loaders_iter[iloader])


# Define your train worker loop
def train_loop_per_worker():
    ds_train = [
        Dataset("clic_edm_ttbar_pf:1.5.0", "train"),
        Dataset("clic_edm_qq_pf:1.5.0", "train"),
        Dataset("clic_edm_ww_fullhad_pf:1.5.0", "train"),
        Dataset("clic_edm_zh_tautau_pf:1.5.0", "train"),
    ]
    ds_test = [
        Dataset("clic_edm_ttbar_pf:1.5.0", "test"),
        Dataset("clic_edm_qq_pf:1.5.0", "test"),
        Dataset("clic_edm_ww_fullhad_pf:1.5.0", "test"),
        Dataset("clic_edm_zh_tautau_pf:1.5.0", "test"),
    ]
    for ds in ds_train:
        print("train_dataset: {}, {}".format(ds, len(ds)))
    for ds in ds_test:
        print("test_dataset: {}, {}".format(ds, len(ds)))

    batch_size = 50
    num_workers = 0
    nepochs = 5
    prefetch_factor = None
    
    train_loaders = [ds.get_loader(batch_size=batch_size, num_workers=num_workers) for ds in ds_train]
    test_loaders = [ds.get_loader(batch_size=batch_size, num_workers=num_workers) for ds in ds_test]
    for dl in train_loaders:
        print("train_loader: {}, {}".format(dl.dataset, len(dl)))
    for dl in test_loaders:
        print("test_loader: {}, {}".format(dl.dataset, len(dl)))

    model = TransformerModel(17, 8, 128, 4, 128, 3, 0.1)
    device = torch.device("cuda")
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    # Get dict of last saved checkpoint.
    initial_global_step_idx = None
    initial_model_params = None

    #restore from checkpoint
    if ray.is_initialized():
        checkpoint = session.get_checkpoint()
        if checkpoint:
            print("checkpoint follows")
            data_dict = checkpoint.to_dict()
            initial_global_step_idx = data_dict["global_step_idx"]
            model.load_state_dict(data_dict["model"])
            optimizer.load_state_dict(data_dict["optimizer"])
            print("Checkpoint restored at global_step_idx={}".format(initial_global_step_idx))

    model_parameters = filter(lambda p: p.requires_grad, model.parameters())
    params = sum([np.prod(p.size()) for p in model_parameters])
    
    loss_fn = torch.nn.MSELoss()

    if ray.is_initialized():
        print("preparing dataset and model for ray")
        train_loaders = [train.torch.prepare_data_loader(dl) for dl in train_loaders]
        test_loaders = [train.torch.prepare_data_loader(dl) for dl in test_loaders]
        model = train.torch.prepare_model(model)

    for dl in train_loaders:
        print("train_loader: {}, {}".format(dl.dataset, len(dl)))
    for dl in test_loaders:
        print("test_loader: {}, {}".format(dl.dataset, len(dl)))

    data_iterator = InterleavedIterator(train_loaders)
    global_step_idx = 0
    for epoch in range(nepochs):
    
        model.train()
        train_loss_vals = []

        steps_per_epoch = sum([len(loader) for loader in train_loaders])

        #skip epoch if it was already trained on
        if initial_global_step_idx and global_step_idx + steps_per_epoch < initial_global_step_idx:
            global_step_idx += steps_per_epoch
            print("skipping epoch {}".format(epoch))
            continue

        for data in data_iterator:
            #skip batch if it was already trained on
            if initial_global_step_idx and global_step_idx < initial_global_step_idx:
                global_step_idx += 1
                continue

            optimizer.zero_grad()
            output = model(data["X"].to(device), data["mask"].to(device))
            loss = loss_fn(output, data["ygen"].to(device))
            loss.backward()
            optimizer.step()
            train_loss_vals.append(loss.item())

            if global_step_idx>0 and global_step_idx%1000 == 0:
                print("checkpoint at global_step={}/{}".format(
                    global_step_idx, nepochs*steps_per_epoch
                ))

                #save checkpoint
                if ray.is_initialized():
                    checkpoint = Checkpoint.from_dict(
                        dict(
                            global_step_idx=global_step_idx,
                            model=model.state_dict(),
                            optimizer=optimizer.state_dict()
                        )
                    )
                    session.report(dict(loss=loss.item()), checkpoint=checkpoint)

            global_step_idx += 1
    
        # model.eval()
        # test_loss_vals = []
        # for test_loader in test_loaders:
        #     for data in test_loader:
        #         output = model(data["X"].to(device), data["mask"].to(device))
        #         loss = loss_fn(output, data["ygen"].to(device))
        #         test_loss_vals.append(loss.item())
    

scaling_config = ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU": 0.5})
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1), verbose=2)

# trainer = TorchTrainer.restore("/home/joosep/ray_results/TorchTrainer_2023-08-25_13-56-05")
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    run_config=run_config,
    torch_config=TorchConfig(backend="gloo")
)
result = trainer.fit()
print(result)


# train_loop_per_worker()

cc @farakiko

PF ground truth definition

Currently, we define the ground truth based on the existing PFAlgo, i.e. the whole PF does a translation of the set of PF elements to the set of PF candidates on an event by event basis.

Ultimately, the ground truth should be defined through generator particles suitably matched to PF elements. Need to understand where HGCAL is with this.

PFBlockAlgo on GPU

Via Kenichi:

A high-performance connected components implementation for GPUs: https://dl.acm.org/doi/10.1145/3208040.3208041

An Optimized Union-Find Algorithm for Connected Components Labeling Using GPUs: https://arxiv.org/abs/1708.08180

CMS ML hackathon task: gun sample training

Get a tagged version of the code:

git checkout master
git pull
git submodule init
git submodule update

Download the training data:

rsync -r --progress lxplus.cern.ch:/eos/user/j/jpata/mlpf/cms/tensorflow_datasets ~/

Run the training on the full model using just the ttbar sample

CUDA_VISIBLE_DEVICES=... python3 mlpf/pipeline.py train -c parameters/cms.yaml

Copy cms.yaml to cms-withgun.yaml and change as follows:

training_datasets:
  - cms_pf_ttbar
  - cms_pf_single_pi
  - cms_pf_single_electron

testing_datasets:
  - cms_pf_ttbar
  - cms_pf_single_pi
  - cms_pf_single_electron

Train again

CUDA_VISIBLE_DEVICES=... python3 mlpf/pipeline.py train -c parameters/cms-withgun.yaml

Note that the training is set currently for 1000 epochs, which you may want to abort early.

Look at the validation plots in experiments/cms*/history/epoch_N/..., especially the energy correlation plots under cls_2/energy_cls2*

try HyperTrack for graph construction

a fast/efficient graph constructor was recently proposed here:
https://github.com/mieskolainen/hypertrack
https://indico.jlab.org/event/459/contributions/11748/attachments/9580/14256/HyperTrack_Mieskolainen_CHEP2023_v1.pdf

we should give a try if this works for graph construction for us.

generator-level dataset validation and training

validate the gen-level dataset using the gun samples (check eta, phi, energy closure)
retrain the model with a gen/sim-level target

readme for using the pretrained model

Provide instructions on how to use the pretrained model from https://zenodo.org/record/8328683.

The steps are:

download the tfds datasets (need to upload to zenodo)
run training or inference using the existing weights

migrate CMS ntuplization to new uproot/awkward

Currently, the CMS ntuplization script https://github.com/jpata/particleflow/issues/mlpf/data/postprocessing2.py uses uproot3, which is the legacy package. Some minor API updates may be necessary.

Retrain with additional high-pt samples

generate a flat QCD sample using QCD_FlatPt_15_3000HS_14, e.g. 20k events
retrain the model and check the MET tails

implement a simple baseline

Given a list of tracks and clusters in the event, create a list of pfcandidates using a simple iterative algo. Could be something like

look for all HFEM, HFHAD, create candidate, remove from event
look for all nearby Track(inner)-ECAL-track(outer)-HCAL, create candidate, remove from event
look for all nearby track(inner)-ECAL pairs, remove from event
look for all track(outer)-HCAL, remove from event
create candidates from remaining track-only

Define and measure the accuracy of this simple algo with respect to the original PFCandidates.

add an ongoing absolute runtime benchmark

Since this questions comes up often, we should have a running and well-understood benchmark of the baseline MLPF model on different computing devices and different datasets.
It should be a single script that loads an ONNX model and can be rerun easily on any new device.

block to candidate regression

Once we have clustered the PF elements (tracks, clusters) into "blocks" based on the truth values in PFCandidate::elementsInBlocks, we have pairs of (block, candidates).
Each block consists usually of a few elements and produces a few candidates, for example:

(TRK, TRK, ECAL) -> (pi, pi)
(TRK, ECAL, HCAL) -> (K)

Therefore, we can run a regression across all blocks from all events to create the reconstructed candidates from the block. I've done a first version of this and it seems to be promising.

As an input, we use all the blocks of size <=3, one-hot encode the element type and standardize the kinematic vectors.
As an output, we have the all the candidates (max 3) produced from the elements, one-hot encoded pdgId, standardized kinematic vectors.

So far, a simple dense net regression is able to generally predict the number of candidates in a block, as well as the first candidate momentum (up to a linear transformation?).

PFCandidate <-> element many-to-many association

Here's an attempt to put private discussions on ML-PF in public.

For ML-PF studies, an idea was to exploit the fact that in the standard PF algo, PFBlock-> elements -> candidates, such that a small set of candidates is expected to produce a single PF candidate. Looking at the output of the algo, I generally find using PFCandidate::elementsInBlocks that the association can be many-to-many, in the sense that one element may be associated with multiple candidates, and one candidate with multiple elements.

Just as an example, here is a partial event from the debugging ntuplizer on a RelValTTbar_13 PU25ns_110X_upgrade2018_realistic_v3 file:

cmsRun test/step3.py
python test/ntuplizer.py ./data/ step3_AOD.root
root -l ./data/step3_AOD.root
root [2] pftree->Scan("clusters_npfcands:clusters_ipfcand0:clusters_ipfcand1:clusters_ipfcand2:clusters_ipfcand3")
***********************************************************************************
*    Row   * Instance * clusters_ * clusters_ * clusters_ * clusters_ * clusters_ *
***********************************************************************************
*        0 *        0 *         3 *       381 *       827 *       974 *         0 *
*        0 *        1 *         1 *      2125 *         0 *         0 *         0 *
*        0 *        2 *         1 *       814 *         0 *         0 *         0 *
*        0 *        3 *         1 *       449 *         0 *         0 *         0 *
*        0 *        4 *         3 *       365 *       626 *       709 *         0 *
*        0 *        5 *         3 *       625 *       637 *       767 *         0 *
*        0 *        6 *         2 *       120 *       484 *         0 *         0 *
*        0 *        7 *         2 *       965 *      1057 *         0 *         0 *
*        0 *        8 *         6 *       307 *       470 *       823 *       890 *

Will need to look in more detail to understand if this is really what PFAlgo is doing, or perhaps some artifact.

cc @jmduarte @lgray @vlimant @pierinim, feel free to include others.

re-enable ONNX export in tensorflow

compile image with tensorflow 2.14.0, python 3.10 and ONNX libraries
- ~~currently blocked by array-record not supporting python 3.11, while the tensorflow docker image comes with 3.11~~
test that ONNX export works
re-enable the code

We had to disable ONNX export in tensorflow due to this bug: onnx/tensorflow-onnx#2180
here: 279f6ae#diff-bc5644c81dfb5b7d28e0ddbb1272481f0874306c725b6de3a5f821e0671ce6bcR501

But recently it's fixed in onnx/tensorflow-onnx#2225 and new versions have been released.

Add printouts and visualizations for misreconstructed cases

In the pipeline.py evaluate step, we could produce some printouts or plots of badly reconstructed cases for further debugging.

Correlation plots for event-level quantities

In the evaluation step, add true vs. predicted correlation plots for quantities like:

MET
jets (need to run fastjet on the true/predicted candidates)

PF in L1

I found this interesting work on using FPGAs for PF in L1 by Giovanni: http://cds.cern.ch/record/2650974/files/CR2018_401.pdf and https://indico.cern.ch/event/587955/contributions/2935764/attachments/1686948/2713029/L1PF-chep-v2.pdf!

try quantization, both post-training and quantization-aware training

Looks like https://github.com/calad0i/HGQ supports very fine grained quantization.
There's also https://github.com/fastmachinelearning/qonnx

PF element clustering

One possible way to progress is to learn to group PF elements based on their their proximity to smallish clusters of elements (miniblocks). We can also use the ground truth clustering based on which elements form disjoint sets based on PFAlgo - ultimately, this is a semi-supervised clustering problem.

The inputs are elements: (nelem, nelem_feat) containing the raw set of all PF elements in the event, with the ground truth being element_block_id: (nelem), which allows to associate elements to disjoint blocks or clusters. The sparse distance matrix between the elements induces an initial graph on the set.

In progress in this issue using GNNs: #2

meta-issue: improvements to pytorch backend

use tfds dataset for pytorch: #201
use Ray for distributing, monitoring and resuming training
allow long-running trainings to be resumed in the middle of the epoch
implement GNN LSH model in pytorch
support fp16 and bf16 mixed-precision training

improve saving the optimizer state

Recently we implemented saving the basic optimizer state in https://github.com/jpata/particleflow/blob/master/mlpf/tfmodel/model_setup.py#L69.
This works fine for Adam with a constant learning rate, but it needs more work to:

save and load the PCGrad optimizer
save and load a variable learning rate

data preprocessing fails with latest awkward

picked up by CI.

    eta = awkward.to_numpy(-np.log(tt, where=tt > 0))
  File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/awkward/highlevel.py", line 1428, in __array_ufunc__
    with ak._errors.OperationErrorContext(name, inputs, kwargs):
  File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/awkward/_errors.py", line 67, in __exit__
    self.handle_exception(exception_type, exception_value)
  File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/awkward/_errors.py", line 82, in handle_exception
    raise self.decorate_exception(cls, exception)
RecursionError: maximum recursion depth exceeded in comparison

This error occurred while calling

    numpy.log.__call__(
        <Array [0.994, 2.12, 3, ..., 2.04, 2.75, 2.92] type='111 * float32'>
        where = <Array [True, True, True, ..., True, True] type='111 * bool'>
    )

Trying to fix by 249c028 and 0dc04b7.

improved loss function

One of the fundamental questions is the computation of the loss function between two sets of particles of different multiplicity.

For example, given the true set, with no natural ordering:

id=211 pt=123.0 eta=... charge=... 
id=130 pt=... ...
id=22 ...

and a predicted set

id=22 pt=29.0 eta=... 
id=22 ...

how to compute a differentiable loss function? The loss must also be computationally efficient, as we have O(10k) particles in the true and predicted sets.

For the moment, we use the object condensation approach, assigning the true particles to some particular "seed" input element, thus converting the problem to multi-classification with a "no-particle" class.

Other options to investigate include:

optimal transport, see earlier thread #13 (comment)
sliced wasserstein distance
maximum mean discrepancy
GAN loss
...

try HyperGraph

https://inspirehep.net/literature/2609884

jpata / particleflow Goto Github PK

particleflow's People

Contributors

Stargazers

Watchers

Forkers

particleflow's Issues

Recommend Projects

Recommend Topics

Recommend Org