Git Product home page Git Product logo

particleflow's People

Contributors

aaditep avatar dependabot[bot] avatar erwulff avatar farakiko avatar jmduarte avatar jpata avatar mapoken avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

particleflow's Issues

fix up git LFS

We are out of LFS space, because at some point, I put weight files on github using LFS, but they are there even after removing.
According to the github documentation, the solution is to delete/recreate the repo, but I'd like to avoid it if possible.

Downloading models/acat2022_20221004_model40M/cms-gen_20220923_163529_426249.gpu0.local/logs/train/events.out.tfevents.1663940144.gpu0.local.2696240.0.v2 (3.0 MB)
Error downloading object: models/acat2022_20221004_model40M/cms-gen_20220923_163529_426249.gpu0.local/logs/train/events.out.tfevents.1663940144.gpu0.local.2696240.0.v2 (521a8e0): Smudge error: Error downloading models/acat2022_20221004_model40M/cms-gen_20220923_163529_426249.gpu0.local/logs/train/events.out.tfevents.1663940144.gpu0.local.2696240.0.v2 (521a8e0dd0f705506862114f11c9856f82bae55543a94041dd0665eea5183cb6): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /home/joosep/test/particleflow/.git/lfs/logs/20230915T104049.046923961.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: models/acat2022_20221004_model40M/cms-gen_20220923_163529_426249.gpu0.local/logs/train/events.out.tfevents.1663940144.gpu0.local.2696240.0.v2: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

delphes test not running due to slow zenodo

the files fail to download from zenodo today.

wget --no-check-certificate -nc https://zenodo.org/record/4559324/files/tev14_pythia8_ttbar_0_0.pkl.bz2
tev14_pythia8_ttbar_0_0.pkl.bz2  1%[>        ] 486.98K  35.9KB/s    eta 22m 21s

unsure if this is a temporary issue or new rate limitations.

CMS inference with GPU

Once cms-sw/cmssw#36963 is available, investigate enabling the MLPFProducer in CMSSW to use GPUs for inference.
This would allow to do an apples-to-apples comparison of PF vs MLPF, given a fully loaded machine (CPU+GPU).

TFDS dataset generation fails for qcd_high_pt in gen-jet clustering

This tfds generation failed at some point halfway through.

tfds build hep_tfds/heptfds/cms_pf/qcd_high_pt
...
Traceback (most recent call last):                              
  File "/usr/local/bin/tfds", line 8, in <module>                       
    sys.exit(launch_cli())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/scripts/cli/main.py", line 102, in launch_cli
    app.run(main, flags_parser=_parse_flags)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/scripts/cli/main.py", line 97, in main
    args.subparser_fn(args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/scripts/cli/build.py", line 192, in _build_datasets
    _download_and_prepare(args, builder)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/scripts/cli/build.py", line 342, in _download_and_prepare
    builder.download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 481, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 1218, in _download_and_prepare
    future = split_builder.submit_split_generation(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/core/split_builder.py", line 310, in submit_split_generation
    return self._build_from_generator(**build_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_datasets/core/split_builder.py", line 371, in _build_from_generator
    for key, example in utils.tqdm(
  File "/usr/local/lib/python3.8/dist-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/joosep/particleflow/hep_tfds/heptfds/cms_utils.py", line 191, in generate_examples
    X, ygen, ycand = prepare_data_cms(str(fi), pad_size)
  File "/home/joosep/particleflow/hep_tfds/heptfds/cms_utils.py", line 162, in prepare_data_cms
    jet_constituents = [index_mapping[idx] for idx in constituent_idx[jet_idx]] # map back to constituent index *before* masking
  File "/home/joosep/particleflow/hep_tfds/heptfds/cms_utils.py", line 162, in <listcomp>
    jet_constituents = [index_mapping[idx] for idx in constituent_idx[jet_idx]] # map back to constituent index *before* masking
IndexError: index 4659 is out of bounds for axis 0 with size 4659

The other ones (ttbar, ztt, qcd) succeeded, so I'm debugging this a bit.

heterogeneous graphs

For the moment, we concatenate input elements of different types (tracks, clusters, ...) into a single feature matrix. This means that there are features that may be defined for one type, but not defined for other types. Our ML model downstream treats all input elements as nodes of the same type. In principle, splitting up the inputs from a single feature matrix to independent feature matrices for each element type should be better.

try the MDMM loss

I believe Michael has a working version for the SSL studies - let's try it out in the codebase.

end to end PF regression

  • Baseline end-to-end training with PFCandidate-to-element matching #12
  • address class imbalance
  • differentiable EMD approximation: #13
  • tensorflow implementation of training for CMSSW
    • spektral, DGL?

Provide an efficient GNN inference implementation using sparsification/quantization with ONNX

Goal: reduce CPU inference time with the ONNX backend

We made some CPU inference performance results public for 2021 in CMS, https://cds.cern.ch/record/2792320/files/DP2021_030.pdf slide 16, “For context, on a single CPU thread (Intel i7-10700 @ 2.9GHz), the baseline PF requires approximately (9 ± 5) ms, the MLPF model approximately 320 ± 50 ms for Run 3 ttbar MC events”.

Now it's a good time to make the ONNX inference as fast as possible, while minimizing any physics impact.

Resources:

PFNet with TF multi-GPU works with 2 and 5 GPUs, but not with 4

This works fine:

CUDA_VISIBLE_DEVICES=5,6,7,8,9 singularity exec --nv /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train

...

Model: "pf_net"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_encoding (InputEncodin multiple                  0
_________________________________________________________________
sparse_hashed_nn_distance (S multiple                  17825
_________________________________________________________________
gnn_id (EncoderDecoderGNN)   multiple                  2375680
_________________________________________________________________
sequential_2 (Sequential)    (5, 6400, 8)              805384
_________________________________________________________________
sequential_3 (Sequential)    (5, 6400, 1)              801793
_________________________________________________________________
gnn_reg (EncoderDecoderGNN)  multiple                  2379776
_________________________________________________________________
sequential_4 (Sequential)    (5, 6400, 5)              807941
=================================================================
Total params: 7,188,399
Trainable params: 7,185,199
Non-trainable params: 3,200
________________________________
Epoch 1/500
 258/3200 [=>............................] - ETA: 19:18 - loss: 92.0839 - charge_loss: 15.1783 - cls_loss: 29.7701 - cos_phi_loss: 14.1853 - energy_loss: 26.3809 - eta_loss: 51.6222 - pt_loss: 3.7117 - sin_phi_loss: 11.3279 - cls_acc_unweighted: 0.7688

and so does

CUDA_VISIBLE_DEVICES=5,6 singularity exec --nv /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train
...
Epoch 1/500
  31/8000 [..............................] - ETA: 1:03:18 - loss: 124.7493 - charge_loss: 16.4039 - cls_loss: 42.0418 - cos_phi_loss: 17.1479 - energy_loss: 32.1395 - eta_loss: 130.0834 - pt_loss: 7.7385 - sin_phi_loss: 11.0048 - cls_acc_unweighted: 0.6926

while this doesn't :

CUDA_VISIBLE_DEVICES=5,6,7,8 singularity exec --nv -B /home /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train

...

Traceback (most recent call last):
  File "mlpf/launcher.py", line 26, in <module>
    main(args, yaml_path, config)
  File "/home/joosep/particleflow/mlpf/tfmodel/model_setup.py", line 755, in main
    fit_result = model.fit(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 555, in call
    outputs = execute.execute(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 5 root error(s) found.
  (0) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_2/pf_net/sparse_hashed_nn_distance/map/while/body/_1664/replica_2/pf_net/sparse_hashed_nn_distance/map/while/map/while/cond/_8452/replica_2/pf_net/sparse_hashed_nn_distance/map/while/map/while/Less_1/_823]]
  (1) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_1/pf_net/sparse_hashed_nn_distance/SparseTensor/dense_shape/_864]]
  (2) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_1/pf_net/sparse_hashed_nn_distance/SparseTensor/dense_shape/_863]]
  (3) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[pf_net/gnn_reg/StatefulPartitionedCall/conv_reg1/map/while/body/_3801/conv_reg1/map/while/SparseReshape/_2974]]
  (4) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_335632]

Function call stack:
train_function -> train_function -> train_function -> train_function -> train_function

tensorflow dataset iteration is slow when IO-bound in a single process

Evaluating the dataset number of steps in tensorflow is currently slow when the loop is IO-bound in a single process, because we use
tf.data.Dataset.from_generator, which uses Python underneath and doesn't release the GIL.

See here: https://github.com/jpata/particleflow/blob/58001fa7d850c20b7d50d696478926ab9be8a41f/mlpf/tfmodel/datasets/BaseDatasetFactory.py#L79C1-L84

It might require some changes upstream in tfds to support ArrayRecordDataSource.as_dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/file_adapters.py#L188

add a loss term wrt. genjet clustering

We want to improve jet and MET resolution of MLPF, but these cannot directly be written into the loss function.
It was suggested by JD that we could cluster the gen-particles to jets and save the jet clustering index for each particle.

Since we have this one-to-one matching between genparticles, input elements and predicted particles, we can add a loss term where we compare jets from the predicted particles based on the genparticle clustering index to jets from the genparticles.

For the TF model, this would mean running fastjet at the tfds stage, saving the jet clustering index for each particle, propagating it to the loss function and adding an additional term.

Quantized graph networks

We need the GNN to be fast - quantization at inference time would be useful.

Here are some references:

  • Degree-Quant: Quantization-Aware Training for Graph Neural Networks, ICML2020

@aaditep has been working on it

switch to tfds data loading for the pytorch backend

Pytorch+tfds now support loading data from the ArrayRecordDataSource, so we can use the same efficient ML input data for both pytorch and tensorflow.

Here's some example code for loading the dataset in pytorch.

#!/usr/bin/env python
# coding: utf-8

import os
import numpy as np

import random
import itertools
import tqdm

import tensorflow_datasets as tfds
import torch
import torch.nn as nn
from torch import Tensor

import ray
from ray.air import session, Checkpoint
from ray import train
from ray.train.torch import TorchTrainer, TorchConfig
from ray.air.config import ScalingConfig
from ray.air.config import RunConfig
from ray.air.config import CheckpointConfig
import ray.data


# In[38]:


def collate_padded_batch(inputs):
    num_samples_in_batch = len(inputs)
    elem_keys = list(inputs[0].keys())
    ret = {}
    for elem_key in elem_keys:
        batch = [Tensor(i[elem_key]) for i in inputs]
        max_seq_len = max([x.shape[0] for x in batch])
        padded_batch_data = [nn.functional.pad(x, (0, 0, 0, max_seq_len - x.shape[0])) for x in batch]
        ret[elem_key] = torch.stack(padded_batch_data, axis=0)
    ret["mask"] = ret["X"][:, :, 0] == 0
    return ret

def my_getitem(self, vals):
    records = self.data_source.__getitems__(vals)
    return [self.dataset_info.features.deserialize_example_np(record, decoders=self.decoders) for record in records]

class Dataset:
    def __init__(self, name="clic_edm_ttbar_pf:1.5.0", split="train"):
        builder = tfds.builder(name, data_dir="/home/joosep/tensorflow_datasets/")

        self.ds = builder.as_data_source(split=split)

        #to prevent a warning from tfds about accessing sequences of indices
        self.ds.__class__.__getitems__ = my_getitem

    def get_sampler(self):
        sampler = torch.utils.data.SequentialSampler(self.ds)
        return sampler

    def get_loader(self, batch_size=20, num_workers=0, prefetch_factor=None):
        return torch.utils.data.DataLoader(
            self.ds,
            batch_size=batch_size,
            collate_fn=collate_padded_batch,
            sampler=self.get_sampler(),
            num_workers=num_workers,
            prefetch_factor=prefetch_factor
        )

    def __len__(self):
        return len(self.ds)

    def __repr__(self):
        return self.ds.__repr__()

from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

class TransformerModel(nn.Module):

    def __init__(self,
                 d_in: int,
                 d_out: int,
                 d_model: int,
                 nhead: int,
                 d_hid: int,
                 nlayers: int,
                 dropout: float = 0.5):
        super().__init__()

        self.linear_in = nn.Linear(d_in, d_model)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout, batch_first=True)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.linear_out = nn.Linear(d_model, d_out)

    def forward(self, src: Tensor, src_key_padding_mask: Tensor) -> Tensor:
        src = self.linear_in(src)
        output = self.transformer_encoder(src, src_key_padding_mask=src_key_padding_mask)
        output = self.linear_out(output)
        return output

class InterleavedIterator(object):
    def __init__(self, data_loaders):
        self.idx = 0
        self.data_loaders_iter = [iter(dl) for dl in data_loaders]
        max_loader_size = max([len(dl) for dl in data_loaders])

        #interleave loaders of different length
        self.loader_ds_indices = []
        for i in range(max_loader_size):
            for iloader, loader in enumerate(data_loaders):
                if i<len(loader):
                    self.loader_ds_indices.append(iloader)

        self.cur_index = 0

    def __iter__(self):
        return self

    def __next__(self):
        iloader = self.loader_ds_indices[self.cur_index]
        self.cur_index += 1
        return next(self.data_loaders_iter[iloader])


# Define your train worker loop
def train_loop_per_worker():
    ds_train = [
        Dataset("clic_edm_ttbar_pf:1.5.0", "train"),
        Dataset("clic_edm_qq_pf:1.5.0", "train"),
        Dataset("clic_edm_ww_fullhad_pf:1.5.0", "train"),
        Dataset("clic_edm_zh_tautau_pf:1.5.0", "train"),
    ]
    ds_test = [
        Dataset("clic_edm_ttbar_pf:1.5.0", "test"),
        Dataset("clic_edm_qq_pf:1.5.0", "test"),
        Dataset("clic_edm_ww_fullhad_pf:1.5.0", "test"),
        Dataset("clic_edm_zh_tautau_pf:1.5.0", "test"),
    ]
    for ds in ds_train:
        print("train_dataset: {}, {}".format(ds, len(ds)))
    for ds in ds_test:
        print("test_dataset: {}, {}".format(ds, len(ds)))

    batch_size = 50
    num_workers = 0
    nepochs = 5
    prefetch_factor = None
    
    train_loaders = [ds.get_loader(batch_size=batch_size, num_workers=num_workers) for ds in ds_train]
    test_loaders = [ds.get_loader(batch_size=batch_size, num_workers=num_workers) for ds in ds_test]
    for dl in train_loaders:
        print("train_loader: {}, {}".format(dl.dataset, len(dl)))
    for dl in test_loaders:
        print("test_loader: {}, {}".format(dl.dataset, len(dl)))

    model = TransformerModel(17, 8, 128, 4, 128, 3, 0.1)
    device = torch.device("cuda")
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    # Get dict of last saved checkpoint.
    initial_global_step_idx = None
    initial_model_params = None

    #restore from checkpoint
    if ray.is_initialized():
        checkpoint = session.get_checkpoint()
        if checkpoint:
            print("checkpoint follows")
            data_dict = checkpoint.to_dict()
            initial_global_step_idx = data_dict["global_step_idx"]
            model.load_state_dict(data_dict["model"])
            optimizer.load_state_dict(data_dict["optimizer"])
            print("Checkpoint restored at global_step_idx={}".format(initial_global_step_idx))

    model_parameters = filter(lambda p: p.requires_grad, model.parameters())
    params = sum([np.prod(p.size()) for p in model_parameters])
    
    loss_fn = torch.nn.MSELoss()

    if ray.is_initialized():
        print("preparing dataset and model for ray")
        train_loaders = [train.torch.prepare_data_loader(dl) for dl in train_loaders]
        test_loaders = [train.torch.prepare_data_loader(dl) for dl in test_loaders]
        model = train.torch.prepare_model(model)

    for dl in train_loaders:
        print("train_loader: {}, {}".format(dl.dataset, len(dl)))
    for dl in test_loaders:
        print("test_loader: {}, {}".format(dl.dataset, len(dl)))

    data_iterator = InterleavedIterator(train_loaders)
    global_step_idx = 0
    for epoch in range(nepochs):
    
        model.train()
        train_loss_vals = []

        steps_per_epoch = sum([len(loader) for loader in train_loaders])

        #skip epoch if it was already trained on
        if initial_global_step_idx and global_step_idx + steps_per_epoch < initial_global_step_idx:
            global_step_idx += steps_per_epoch
            print("skipping epoch {}".format(epoch))
            continue

        for data in data_iterator:
            #skip batch if it was already trained on
            if initial_global_step_idx and global_step_idx < initial_global_step_idx:
                global_step_idx += 1
                continue

            optimizer.zero_grad()
            output = model(data["X"].to(device), data["mask"].to(device))
            loss = loss_fn(output, data["ygen"].to(device))
            loss.backward()
            optimizer.step()
            train_loss_vals.append(loss.item())

            if global_step_idx>0 and global_step_idx%1000 == 0:
                print("checkpoint at global_step={}/{}".format(
                    global_step_idx, nepochs*steps_per_epoch
                ))

                #save checkpoint
                if ray.is_initialized():
                    checkpoint = Checkpoint.from_dict(
                        dict(
                            global_step_idx=global_step_idx,
                            model=model.state_dict(),
                            optimizer=optimizer.state_dict()
                        )
                    )
                    session.report(dict(loss=loss.item()), checkpoint=checkpoint)

            global_step_idx += 1
    
        # model.eval()
        # test_loss_vals = []
        # for test_loader in test_loaders:
        #     for data in test_loader:
        #         output = model(data["X"].to(device), data["mask"].to(device))
        #         loss = loss_fn(output, data["ygen"].to(device))
        #         test_loss_vals.append(loss.item())
    

scaling_config = ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU": 0.5})
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1), verbose=2)

# trainer = TorchTrainer.restore("/home/joosep/ray_results/TorchTrainer_2023-08-25_13-56-05")
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    run_config=run_config,
    torch_config=TorchConfig(backend="gloo")
)
result = trainer.fit()
print(result)


# train_loop_per_worker()

cc @farakiko

PF ground truth definition

Currently, we define the ground truth based on the existing PFAlgo, i.e. the whole PF does a translation of the set of PF elements to the set of PF candidates on an event by event basis.

Ultimately, the ground truth should be defined through generator particles suitably matched to PF elements. Need to understand where HGCAL is with this.

CMS ML hackathon task: gun sample training

Get a tagged version of the code:

git checkout master
git pull
git submodule init
git submodule update

Download the training data:

rsync -r --progress lxplus.cern.ch:/eos/user/j/jpata/mlpf/cms/tensorflow_datasets ~/

Run the training on the full model using just the ttbar sample

CUDA_VISIBLE_DEVICES=... python3 mlpf/pipeline.py train -c parameters/cms.yaml

Copy cms.yaml to cms-withgun.yaml and change as follows:

training_datasets:
  - cms_pf_ttbar
  - cms_pf_single_pi
  - cms_pf_single_electron

testing_datasets:
  - cms_pf_ttbar
  - cms_pf_single_pi
  - cms_pf_single_electron

Train again

CUDA_VISIBLE_DEVICES=... python3 mlpf/pipeline.py train -c parameters/cms-withgun.yaml

Note that the training is set currently for 1000 epochs, which you may want to abort early.

Look at the validation plots in experiments/cms*/history/epoch_N/..., especially the energy correlation plots under cls_2/energy_cls2*

implement a simple baseline

Given a list of tracks and clusters in the event, create a list of pfcandidates using a simple iterative algo. Could be something like

  1. look for all HFEM, HFHAD, create candidate, remove from event
  2. look for all nearby Track(inner)-ECAL-track(outer)-HCAL, create candidate, remove from event
  3. look for all nearby track(inner)-ECAL pairs, remove from event
  4. look for all track(outer)-HCAL, remove from event
  5. create candidates from remaining track-only

Define and measure the accuracy of this simple algo with respect to the original PFCandidates.

add an ongoing absolute runtime benchmark

Since this questions comes up often, we should have a running and well-understood benchmark of the baseline MLPF model on different computing devices and different datasets.
It should be a single script that loads an ONNX model and can be rerun easily on any new device.

block to candidate regression

Once we have clustered the PF elements (tracks, clusters) into "blocks" based on the truth values in PFCandidate::elementsInBlocks, we have pairs of (block, candidates).
Each block consists usually of a few elements and produces a few candidates, for example:

(TRK, TRK, ECAL) -> (pi, pi)
(TRK, ECAL, HCAL) -> (K)

Therefore, we can run a regression across all blocks from all events to create the reconstructed candidates from the block. I've done a first version of this and it seems to be promising.

As an input, we use all the blocks of size <=3, one-hot encode the element type and standardize the kinematic vectors.
As an output, we have the all the candidates (max 3) produced from the elements, one-hot encoded pdgId, standardized kinematic vectors.

So far, a simple dense net regression is able to generally predict the number of candidates in a block, as well as the first candidate momentum (up to a linear transformation?).

PFCandidate <-> element many-to-many association

Here's an attempt to put private discussions on ML-PF in public.

For ML-PF studies, an idea was to exploit the fact that in the standard PF algo, PFBlock-> elements -> candidates, such that a small set of candidates is expected to produce a single PF candidate. Looking at the output of the algo, I generally find using PFCandidate::elementsInBlocks that the association can be many-to-many, in the sense that one element may be associated with multiple candidates, and one candidate with multiple elements.

Just as an example, here is a partial event from the debugging ntuplizer on a RelValTTbar_13 PU25ns_110X_upgrade2018_realistic_v3 file:

cmsRun test/step3.py
python test/ntuplizer.py ./data/ step3_AOD.root
root -l ./data/step3_AOD.root
root [2] pftree->Scan("clusters_npfcands:clusters_ipfcand0:clusters_ipfcand1:clusters_ipfcand2:clusters_ipfcand3")
***********************************************************************************
*    Row   * Instance * clusters_ * clusters_ * clusters_ * clusters_ * clusters_ *
***********************************************************************************
*        0 *        0 *         3 *       381 *       827 *       974 *         0 *
*        0 *        1 *         1 *      2125 *         0 *         0 *         0 *
*        0 *        2 *         1 *       814 *         0 *         0 *         0 *
*        0 *        3 *         1 *       449 *         0 *         0 *         0 *
*        0 *        4 *         3 *       365 *       626 *       709 *         0 *
*        0 *        5 *         3 *       625 *       637 *       767 *         0 *
*        0 *        6 *         2 *       120 *       484 *         0 *         0 *
*        0 *        7 *         2 *       965 *      1057 *         0 *         0 *
*        0 *        8 *         6 *       307 *       470 *       823 *       890 *

Will need to look in more detail to understand if this is really what PFAlgo is doing, or perhaps some artifact.

cc @jmduarte @lgray @vlimant @pierinim, feel free to include others.

re-enable ONNX export in tensorflow

  • compile image with tensorflow 2.14.0, python 3.10 and ONNX libraries
    • currently blocked by array-record not supporting python 3.11, while the tensorflow docker image comes with 3.11
  • test that ONNX export works
  • re-enable the code

We had to disable ONNX export in tensorflow due to this bug: onnx/tensorflow-onnx#2180
here: 279f6ae#diff-bc5644c81dfb5b7d28e0ddbb1272481f0874306c725b6de3a5f821e0671ce6bcR501

But recently it's fixed in onnx/tensorflow-onnx#2225 and new versions have been released.

PF element clustering

One possible way to progress is to learn to group PF elements based on their their proximity to smallish clusters of elements (miniblocks). We can also use the ground truth clustering based on which elements form disjoint sets based on PFAlgo - ultimately, this is a semi-supervised clustering problem.

The inputs are elements: (nelem, nelem_feat) containing the raw set of all PF elements in the event, with the ground truth being element_block_id: (nelem), which allows to associate elements to disjoint blocks or clusters. The sparse distance matrix between the elements induces an initial graph on the set.

In progress in this issue using GNNs: #2

meta-issue: improvements to pytorch backend

  • use tfds dataset for pytorch: #201
  • use Ray for distributing, monitoring and resuming training
  • allow long-running trainings to be resumed in the middle of the epoch
  • implement GNN LSH model in pytorch
  • support fp16 and bf16 mixed-precision training

data preprocessing fails with latest awkward

picked up by CI.

    eta = awkward.to_numpy(-np.log(tt, where=tt > 0))
  File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/awkward/highlevel.py", line 1428, in __array_ufunc__
    with ak._errors.OperationErrorContext(name, inputs, kwargs):
  File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/awkward/_errors.py", line 67, in __exit__
    self.handle_exception(exception_type, exception_value)
  File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/awkward/_errors.py", line 82, in handle_exception
    raise self.decorate_exception(cls, exception)
RecursionError: maximum recursion depth exceeded in comparison

This error occurred while calling

    numpy.log.__call__(
        <Array [0.994, 2.12, 3, ..., 2.04, 2.75, 2.92] type='111 * float32'>
        where = <Array [True, True, True, ..., True, True] type='111 * bool'>
    )

Trying to fix by 249c028 and 0dc04b7.

improved loss function

One of the fundamental questions is the computation of the loss function between two sets of particles of different multiplicity.

For example, given the true set, with no natural ordering:

id=211 pt=123.0 eta=... charge=... 
id=130 pt=... ...
id=22 ...

and a predicted set

id=22 pt=29.0 eta=... 
id=22 ...

how to compute a differentiable loss function? The loss must also be computationally efficient, as we have O(10k) particles in the true and predicted sets.

For the moment, we use the object condensation approach, assigning the true particles to some particular "seed" input element, thus converting the problem to multi-classification with a "no-particle" class.

Other options to investigate include:

  • optimal transport, see earlier thread #13 (comment)
  • sliced wasserstein distance
  • maximum mean discrepancy
  • GAN loss
  • ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.