Git Product home page Git Product logo

ndlar_flow's Introduction

Documentation Status Test Status

environment

First, download this code:

git clone https://github.com/larpix/ndlar_flow
cd ndlar_flow

To install proper dependencies, use the provided conda environment file env.yaml:

conda env create -f env.yaml -n <environment name>
conda activate <environment>

To update an existing environment:

conda env update -f env.yaml -n <environment name>

If MPI is not available, you may use the alternative environment file (env-nompi.yaml) that does not install parallel-HDF5. The module0 flow code is built off of h5flow [https://github.com/peter-madigan/h5flow], so you will also need to install this in order to run any of the workflows described here.

To install:

pip install .

tips for installing at NERSC

Because CORI uses a special build of MPICH MPI, you will need to follow the instructions at [https://docs.nersc.gov/development/languages/python/parallel-python/] to install a parallelized version of h5py before installing h5flow. Note that this environment will then only be usable on the CORI compute nodes, unless you have h5flow>=0.2.1 which provides flags for running without MPI.

If you'd like to set up a debugging environment that works on the CORI login nodes, install using the *-nompi environment files. This will not allow you to take advantage of the parallelism of h5flow, and so isn't recommended for production jobs.

usage

There is no "executable" for module0_flow, instead, it uses h5flow workflow yaml files and one of the entry points to h5flow: python -m h5flow -c ..., run_h5flow.py -c ..., or h5flow -c .... When this is run, it will set up a loop using the yaml file and following the h5flow specification, namely, a loop dataset, a series of "stages" to run on that dataset, and a set of "resources" to make available to each stage. h5flow handles the instantiation of python objects, the access to the data file, and the workflow sequencing, while module0_flow provides workflow descriptions (under yamls/module0_flow/workflows/), module configurations (under yamls/module0_flow/{reco,resources,...}), and the source code for each module for those workflows (under module0_flow/). The intention behind using h5flow and separating the reconstruction and calibration into modules is to allow for:

  1. flexibility - a purely modular workflow has better separation of code which enables easier collaboration between multiple developers and can allow for code reuse by making modules generic and re-usable for different purposes.
  2. portability - the only dependency for reading h5flow files is HDF5 which makes interfacing with module0_flow files easier, module-wise versioning makes developers think about compatiblitity early and often, and module-wise persistency makes adding new data objects easier and often with no reprocessing of data
  3. incrementalism - intermediate data objects can be saved and built upon, so development can occur from any intermediate step and data processing can be checkpointed at any stage.

That is the intention at any rate - if you have questions/comments/suggestions for improvement, please feel welcome to open an issue at [https://github.com/peter-madigan/module0_flow/issues] :)

There is a tutorial repo at [https://github.com/peter-madigan/module0_flow_tutorial] that should help you get started, but below are the basic building blocks for module0_flow:

The module0_flow reconstruction chain breaks up the reconstruction into the following steps for each component of the reconstruction. For charge-only reconstruction, there are two workflows:

  1. charge_event_building.yaml
  2. charge_event_reconstruction.yaml

which are run in sequence. For light-only reconstruction, there are corresponding workflows:

  1. light_event_building_adc64.yaml (and light_event_building_mc.yaml)
  2. light_event_recontruction.yaml

Finally, to perform the combined reconstruction using information from both sub-systems, first generate the list of associations between the detectors using charge_to_light_association.yaml. Then run the reconstruction with combined_reconstruction.yaml.

charge event building

To run charge event builder:

mpiexec h5flow -c yamls/module0_flow/workflows/charge/charge_event_building.yaml \
    -i <input file> -o <output file>

This generates the charge/raw_events and charge/packets datasets. The input file is a "datalog"- (a.k.a "packet"-) formatted LArPix data file. This workflow step is the same for data and simulation.

charge event reconstruction

To run charge reconstruction:

mpiexec h5flow -c yamls/module0_flow/workflows/charge/charge_event_reconstruction.yaml \
    -i <input file> -o <output file>

This generates charge/packets_corr_ts, charge/ext_trigs, charge/hits, and charge/events datasets. The input file is a charge event built module0_flow file.

light event building

To run light event builder on data:

mpiexec h5flow -c yamls/module0_flow/workflows/light/light_event_building_adc64.yaml \
    -i <input file> -o <output file>

This generates the light/events and light/wvfm datasets. The input file is a raw ADC64-formatted .data file.

To run light event builder on simulation:

mpiexec h5flow -c yamls/module0_flow/workflows/light/light_event_building_adcmc.yaml \
    -i <input file> -o <output file>

This generates the same light/events and light/wvfm datasets as the data, but the input file is a larnd-sim HDF5 file.

light event reconstruction

To run light reconstruction:

mpiexec h5flow -c yamls/module0_flow/workflows/light/light_event_reconstruction.yaml \
    -i <input file> -o <output file>

This generates light/t_ns and light/hits datasets. The input file is a light event built module0_flow file. The light event reconstruction also removes raw waveforms from the file.

charge-to-light association

To associate charge events to light events, run:

mpiexec h5flow -c yamls/module0_flow/workflows/charge/charge_light_association.yaml \
    -i <input file> -o <output file>

This creates references between charge/ext_trigs and light/events as well as charge/events and light/events. Both charge and light reconstructed events are expected in the input file, which can be facilitated by running both charge and light reconstruction flows on the same output file or by using the h5copy hdf5 tool:

# copy light data from a source file
h5copy -v -f ref -s light -d light -i <light event file> \
    -o <destination file>

# copy charge data from a source file
h5copy -v -f ref -s charge -d charge -i <charge event file> \
    -o <destination file>

merged event reconstruction

To generate T0s and tracks, run:

mpiexec h5flow -c yamls/module0_flow/workflows/combined/combined_reconstruction.yaml \
    -i <input file> -o <output file>

minimal staging

Running these commands one after the other can be tedious, but with h5flow version 0.1.8, you can combine them into only two commands:

output_file=<output file>

mpiexec h5flow -c \
    yamls/module0_flow/workflows/light/light_event_building_adc64.yaml \
    yamls/module0_flow/workflows/light/light_event_reconstruction.yaml \
    -i <input light file> \
    -o $output_file

mpiexec h5flow -c \
    yamls/module0_flow/workflows/charge/charge_event_building.yaml \
    yamls/module0_flow/workflows/charge/charge_event_reconstruction.yaml \
    yamls/module0_flow/workflows/charge/charge_light_association.yaml \
    yamls/module0_flow/workflows/combined/combined_reconstruction.yaml \
    -i <input charge file> \
    -o $output_file

file structure and access

Let's walk through an example of how to access and use the hdf5 file format containing both light and charge data using two different approaches: the first is much more verbose, but is more flexible, while the second requires minimal code, but has some limitations. As an example, we will perform a mock analysis to compare the light system waveform integrals to the larpix charge sum.

So let's start with the first approach, we'll open up the file using h5py:

import h5py
f = h5py.File('<example file>.h5','r')

And list the available datasets using visititems, which will call a specific function on all datasets and groups within the file. In particular, let's have it print out all available datasets:

my_func = lambda name,dset : print(name) if isinstance(dset, h5py.Dataset) \
    else None
f.visititems(my_func)

This will print out quite a number of things, but you'll notice three different types of paths:

  1. paths that end in .../data
  2. paths that end in .../ref
  3. paths that end in .../ref_region

The first contain the primitive data for that particular object as a 1D structured array, so for our example we want to access the charge sum for each event. So first, let's check what fields are available in the 'charge/events/data' dataset:

print(f['charge/events/data'].dtype.names)

And then we can access the data by the field name:

charge_qsum = f['charge/events/data']['q']
print(charge_qsum.shape, charge_qsum.dtype)

The second type of path (ending in .../ref) contain bi-directional references between two datasets. In particular, the paths to these datasets are structured like <parent dataset name>/ref/<child dataset name>/ref. Each entry in the .../ref dataset corresponds to a single link between the parent and child datasets:

f['charge/events/ref/light/events/ref'][0]
# returns something like [1, 2]

By convention, the first value corresponds to the index into the charge/events/data dataset and the second value corresponds to the index into the light/events/data dataset. To use, you can directly pass these references as indices into the corresponding datasets:

ref = f['charge/events/ref/light/events/ref'][0]
# get the first charge event that has a light event associated with it
f['charge/events/data'][ref[0]]
# get the light event associated with the first charge event
f['light/events/data'][ref[1]]

You could loop over these references and load the rows of the dataset in that way, but it would be very slow. Instead, h5flow offers a helper function (dereference) to load references:

from h5flow.data import dereference

# reference dataset you want to use
ref = f['charge/events/ref/light/events/ref']
# data you want to load
dset = f['light/events/data']
# parent indices you want to use (i.e. event id 0)
sel = 0

# this will load *ALL* the references
# and then find the data related to your selection
data = dereference(sel, ref, dset)

# other selections are possible, either slices or iterables
dereference(slice(0,100), ref, dset)
dereference([0,1,2,3,1,0], ref, dset)

Data is loaded as a numpy masked array with shape (len(sel), max_ref). So if there are only up to 5 light events associated any of the 100 charge events we wanted before:

print(data.shape, data.dtype) # e.g. (100, 5)

The first dimension corresponds to our charge event selection and the second dimension corresponds to the light event(s) that are associated with a given charge event.

We can also load references with the opposite orientation (e.g. light/events -> charge/events), by using the ref_direction argument:

# we use the same reference dataset as before
ref = f['charge/events/ref/light/events/ref']
# but now we load from the charge dataset
dset = f['charge/events/data']
# and the parent indices correspond to positions within the light events
sel = 0 # get charge events associated with the first light event

# to load, we modify the reference direction from (0,1) [default] to (1,0)
# since we want to use the second index of the ref dset as the "parent" and
# the first index as the "child"
data = dereference(sel, ref, dset, ref_direction=(1,0))
print(data.shape, data.dtype)

Loading references can take some time if you have a very large reference dataset (>50k). To speed things up, we can can use the ../ref_region datasets to find out where in the reference dataset we need to look for each item. In particular, this dataset provides a 'start' and 'stop' index for each item:

# get the bounds for where the first charge event references exist within
# the ref dataset
sel = 0
region = f['charge/events/ref/light/events/ref_region'][sel]

# the first index in ref that is associated with charge event 0
print(region['start'])
# the last index + 1 in ref that is associated with charge event 0
print(region['stop'])

# gets all references that *might* be associated with charge event 0
ref = f['charge/events/ref/light/events/ref'][region['start']:region['stop']]
print(ref)

You can use this dataset with the helper function to load referred data in an efficient way (this is the recommended approach):

sel = 0
ref = f['charge/events/ref/light/events/ref']
dset = f['light/events/data']

region = f['charge/events/ref/light/events/ref_region']

# this will load only necessary references and then find the data related
# to your selection
data = dereference(sel, ref, dset, region=region)

For datasets with a trivial 1:1 relationship (light/events/data and light/wvfm/data in this case), you can directly use the references for one of the datasets for any of the others:

light_events = dereference(sel, ref, f['light/events/data'], region=region)
light_wvfms = dereference(sel, ref, f['light/wvfm/data'], region=region)

Now that we have both the event information and the waveform data, we can compare the charge sum of an event to the integral of the raw waveforms:

import numpy.ma as ma # use masked arrays

# we'll only look at a events 0-1000 since the raw waveforms will use a
# lot of memory
sel = slice(0,1000)

# first get the data
ref = f['charge/events/ref/light/events/ref']
dset = f['light/events/data']
region = f['charge/events/ref/light/events/ref_region']

charge_events = f['charge/events/data'][sel]
light_events = dereference(sel, ref, f['light/events/data'], region=region)
light_wvfms = dereference(sel, ref, f['light/wvfm/data'], region=region)

print('charge_events:',charge_events.shape)
print('light_events:',light_events.shape)
print('light_wvfms:',light_wvfms.shape)

# now apply a channel mask to the waveforms to ignore certain channels
# and waveforms
valid_wvfm = light_events['wvfm_valid'].astype(bool)
# (event index, light event index, adc index, channel index)
print('valid_wvfm',valid_wvfm.shape)
channel_mask = np.zeros_like(valid_wvfm)
sipm_channels = np.array(
    [2,3,4,5,6,7] + [18,19,20,21,22,23] + [34,35,36,37,38,39] + \
    [50,51,52,53,54,55] + \
    [9,10,11,12,13,14] + [25,26,27,28,29,30] + [41,42,43,44,45,46] + \
    [57,58,59,60,61,62]
)
channel_mask[:,:,:,sipm_channels] = True

samples = light_wvfms['samples']
# (event index, light event index, adc index, channel index, sample index)
print('samples:',samples.shape)
# numpy masked arrays use the mask convention: True == invalid
samples.mask = samples.mask | np.expand_dims(~channel_mask,-1) | \
    np.expand_dims(~valid_wvfm,-1)

# now we can subtract the pedestals (using the mean of the first 50 samples)
samples = samples.astype(float) - samples[...,:50].mean(axis=-1, keepdims=True)

# and we can integrate over each of the dimensions:
# axis 4 = integral over waveform, axis 3 = sum over valid channels,
# axis 2 = sum over valid adcs, axis 1 = sum over light events associated
#          to a charge event
light_integrals = samples.sum(axis=4).sum(axis=3).sum(axis=2).sum(axis=1)

# we can either create a mask for only the valid entries (i.e. the charge-
# to-light association exists)
valid_event_mask = ~light_integrals.mask
# or we can zero out the invalid entries (beware: this will update the
# light_integral.mask to indicate that these are now valid entries)
light_integrals[light_integrals.mask] = 0.

And we plot the correlation between the charge and light systems:

import matplotlib.pyplot as plt

plt.ion()
plt.figure()
plt.hist2d(charge_qsum[valid_event_mask], light_integrals[valid_event_mask],
    bins=(1000,1000))
plt.xlabel('Charge sum [mV]')
plt.ylabel('Light integral [ADC]')

h5flow also has the capability of traversing multiple references using the dereference_chain helper function. I will leave it to you to visit the h5flow docs and to play around with this functionality.

Ok, so that's how to access data using the verbose and flexible approach. Now let's do it the quick and easy way.

We'll use an H5FlowDataManager object to help:

from h5flow.data import H5FlowDataManager
dm = H5FlowDataManager('<input file>', 'r', mpi=False)

This object has built-in smart reference traversal via the __getitem__ special method. If one argument is specified, it acts as a pass-through to an underlying h5py.File:

dm['light/events/data'] # get the light events dataset
dm['light/events'] # get the light events group
dm['light/events'].attrs # get light event attributes

But when using multiple arguments, it will load references:

# again lets get the first 1000 charge events
charge_events = dm['charge/events', sel]
# (event index,)

# and now we use the fancy access method
light_events = dm['charge/events','light/events',sel]
# (event index, light event index)

# and we can also get the waveforms, but only if the light/events -> light/wvfm references exist
light_wvfm = dm['charge/events','light/events','light/wvfm',sel]
# (event index, light event index, light waveform index)

That's certainly much cleaner! But in this case, you are limited in only traversing references that are explicitly defined so references can't do double duty for multiple datasets. You also are not able to just load the reference index by itself. So, this approach might not be suited for every situation.

There is also a plotting script at scripts/map_file.py which will generate a map of all of the references included in the file. You will need networkx installed in order to run this. Run with:

python scripts/map_file.py <file>

And that concludes the intro into the data access!

For more details on what different fields in the datatypes mean, look at the module-specific documentation. For more details on how to use the dereferencing schema, look at the h5flow documentation [https://h5flow.readthedocs.io/en/latest/].

ndlar_flow's People

Contributors

peter-madigan avatar diaza avatar krwood avatar liviocali avatar mjkramer avatar yifanc avatar alexbooth92 avatar edhinkle avatar cuddandr avatar awh1t3 avatar jaafar-chakrani avatar ronandoherty1 avatar seg188 avatar kvtsang avatar silentkartographer avatar

Stargazers

 avatar Biao Wang avatar  avatar  avatar Masanori Ogino avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ndlar_flow's Issues

Hit merging

One parameter we can potentially tune is the event chunk size here:


Because the hits are padded to the maximum number of the prompt hits of any event in this chunk. I think it would be helpful to have smaller chunk size.
My naive understanding is that these three parameters here are not independent.
I think the minimum time time between hits are ~15ticks (please correct me), so it won't merge more than 5 merge steps, and max_contrib_segments can be 5 x 10 (where 5 is the merge steps and 10 is placeholder for the number of stored segments per prompt hit)

proto_nd_flow mangled truth information

The truth information in proto_nd_flow has never been setup properly. Some care should be taken to implement and validate all of the backtracking capabilities we want at this stage.

Various MiniRun5 isssues with trajectory assignments and final hit associated segments

I'm currently using flow manager to try and look at prompt hit segments and trajectory segments (and final hit associated segments/tracks). I have run into a few different issues. Not sure if they're related to one another, but figured I'd mention all of them at once (didn't see any of these brought up as issues elsewhere...)

Just as setup, here are the different variables that I am plotting throughout the following examples:

f_name = 'C:/Users/AJN/Documents/DUNE/sample_files/MiniRun5_1E19_RHC.flow.0000001.FLOW.hdf5'
f_manager = h5flow.data.H5FlowDataManager(f_name, 'r')
PromptHits_ev = f_manager["charge/events", "charge/calib_prompt_hits", i_evt]
Segs_PromptHits = f_manager["charge/events", "charge/calib_prompt_hits", "charge/packets", "mc_truth/segments", i_evt]
Segs_FinalHits = f_manager["charge/events", "charge/calib_final_hits", "charge/calib_prompt_hits", "charge/packets", "mc_truth/segments", i_evt]
Trajectories_PromptHits = f_manager["charge/events","charge/calib_prompt_hits", "charge/packets",  "mc_truth/segments", "mc_truth/trajectories", i_evt]
Trajectories_FinalHits = f_manager["charge/events",  "charge/calib_final_hits", "charge/calib_prompt_hits", "charge/packets",  "mc_truth/segments", "mc_truth/trajectories", i_evt]

First: it seems that, with Segs_FinalHits, which I am plotting in the exact same way as the Flow tutorial from today (and Segs_PromptHits is being plotted correctly with the exact same code), although one end of of each segment is accurately being plotted with the prompt hits (which align with the final hits), the other end is always being plotted towards a point in the center:
image

Secondly: it seems that Trajectories_PromptHits and Trajectories_FinalHits both have the wrong units on them somewhere:
image
image

Even if I divide all x/y/z coordinates of the trajectories by 100 to display them a bit better, however, they still don't seem to match up with the segments:
image
image

Add light detector geometry information

In order to extract some useful quantities like the visible energy, it is necessary to include some information about the light detector geometry. A possible approach would be:

  • create a yaml file format similar to the pixel layout containing adc #, channel # mapped to detector #, and detector # mapped to positions and geometries
  • add necessary methods to resources/geometry.py to read in and interpret the geometry file description

Add option to load charge and light geometry info separately

Right now the workflow will run charge and light in parallel, which means the charge and light information are loaded twice (or even three times maybe charge-light matching will require that too) and some information are loaded but not needed.

Clock Synchornization

Revisit timestamp calibration to synchronize the various clocks we have in the various readout electronics hardware

  • LRO and each PACMAN controller have their own dedicated clocks that tick at slightly different frequencies
  • This was carefully calibrated in module0 data, but never revisited for the module1-3 data

In addition to extracting the calibration constraints from mod1-3 data, need to automate the way these are extracted and applied in the processing routines.

Improve the end-to-end flow

Include the code or link the code that extract the configuration or dependency files in the flow with clear documentation.

Hugely inflated converted light files for single module runs

I have noticed when converting light data files from module1 that the converted files are much larger than their original .data files. For example, the original .data files for timestamp 20220208_124818 are 15.2 + 15.2 = 30.4 GB. While the data converted with the adc64 formatter tool directly is 6.6 + 6.7 = 13.3 GB. Using ndlar_flow's version of the converter, the new file is 80.8 GB. Then when I try to run the light event building + reconstruction, the new file ends up being ~150 GB. I haven't checked with the other modules, but I suspect the issue will exist with them too (should be checked). I don't think the issue exists for 2x2 since I looked at a file from the first ramp up (mpd_run_hvramp_rctl_079_p0.data), and the .data file has 4.5GB while the converted file has 4.4GB. So the issue seems to be isolated to the adc64 formatter in flow, not the mpd version for 2x2.

Event t0

When we use SymmetricWindowRawEventBuilder , if there's no external trigger, the time of the first arrival hit is considered as event t0, which can be horrendously wrong.

  1. Shall we explicitly label events without external triggers downstream from the flow stage?
  2. It might not be good idea, but for cathode crossing events, shall we try stitching the hits at the cathode and consider that as the t0?
evt7 evt10

Time corrector values need to be updated

A comparison to the light system PPS timestamps gives that the parameters should be:

  • io_group=1: p0=9.49ticks, p1=3.83e-6
  • io_group=2: p0=9.42ticks, p1=1.10e-6

And the correction function needs to take into account the overall offset:

ts_corr = p0 + p1 * ts_raw

where the units are all in ticks.

Port Over / Adapt / Accommodate Low Energy Reconstruction

It has been suggested to adapt or port over my reconstruction code for low energy events charge to light matching to flow. This would make reconstruction more efficient, so that when we run flow for 2x2 (and other ND-LAr prototypes) we can also reconstruct the low energy events as well as other interesting events. We should consider how this would best be done. I think there are components of the reconstruction, like some of the charge reconstruction and the charge light matching, that could just use flow's built in modules. The important aspect would be to ensure we can do clustering of the hits and that we retain the small clusters. Many of the events would constitute a single 1-3 hit cluster in a drift window, so we need to make sure that the event building we use in flow does not inadvertently get rid of low energy events.

https://github.com/DUNE/ndlar_39Ar_reco

Manage the printout

Set verbose option for the printouts instead of the default in the current workflow.

Provide support for full ndlar geometries

There is interest in leveraging the work done for the 2x2 simulation+calibration+reconstruction chain for ND-LAr. At the moment a [minor] missing piece to having that in hand is at the "flow" stage. It should be easy to create modules and configurations for a ND-LAr geometry by copying over the proto_nd_flow (= 2x2) codes; however, this would also be a good chance to address Issue #58 as well.

Inconsistencies in proto_nd_flow Geometry Resource Source Code Labeling

The proto_nd_flow geometry resource source code is located in ndlar_flow/src/proto_nd_flow/resources/geometry.py. Inconsistencies include:

  • The create_regions method still assumes that the drift direction is z. This also affects the regions property, which saves the active TPC volume from the geometry input file. Additionally, this affects the in_fid method, which checks whether or not points are in the detector fiducial volume given the active TPC volume and configurable fiducial volume cuts. From my initial checks, the in_fid method only comes up in analysis source code and in the electron_lifetime tool, neither of which are fully integrated into proto_nd_flow at the moment.

  • Throughout the file, methods are labeled with the old coordinate convention. For instance, anode_z is a lookup table for the anode “z” coordinate, which is really the anode x coordinate in the new geometry. Other examples include the pixel_xy and get_z_coordinate methods. As Kevin noted in the meeting, these methods are used correctly in alignment with the new geometry in files such as ndlar_flow/src/proto_nd_flow/reco/charge/calib_prompt_hits.py, but this could cause future confusion (i.e. see here where the x coordinate in the output dataset is assigned from the z array).

  • Also throughout the file, distance units are noted to be mm instead of cm. This can be seen, for example, in the assignment of the pixel_pitch property. Hit datasets currently use units of cm, and I can see how this might cause future confusion if someone goes back to the geometry source code to check units. There seems to be confusion as to whether cm or mm are the correct units to use within flow, so this question must also be discussed and resolved.

Issues listed are limited to those related to charge methods/properties. Furthermore, this isn’t necessarily a comprehensive list of inconsistencies within the charge-related methods/properties associated with the Geometry resource.

Bug in charge raw event builder

Plotting the event timestamp vs event id, shows a regular, recurring pattern of timestamps (with repetition 2 on a 2-core machine). This suggests that the slicing of the packets dataset or the saving of the events objects is not being handled properly.

proto_nd_flow using too much memory when running charge reconstruction

I have been processing some samples made with essentially a particle gun through proto_nd_flow. It is a sample of 1000 events with 1-5GeV vertically oriented MIP muons in 2x2. During the charge_event_building stage, during the run on charge/raw_events, the memory usage is between about 75 GB and 150 GB -- obviously this is way too large. It does take a while to run this step, but I'm not sure if this is the expected amount of time or if it is related to the memory issue. I am running on Perlmutter using the develop branch of ndlar_flow. The file I'm using is at /global/cfs/cdirs/dune/users/sfogarty/verticalMuons_1000_events_2x2_1GeV_to_5GeV_larndsim_newLightNoise.h5 if someone wants to try to see if they can replicate the issue.

The flow output I get is:

WARNING:root:Running without mpi4py because No module named 'mpi4py'
~~~ H5FLOW ~~~
output file: /global/cfs/cdirs/dune/users/sfogarty//verticalMuons_1000_events_2x2_1GeV_to_5GeV_larndsim_newLightNoise_flow.h5
input file: /global/cfs/cdirs/dune/users/sfogarty//verticalMuons_1000_events_2x2_1GeV_to_5GeV_larndsim_newLightNoise.h5
~~~~~~~~~~~~~~
~~~ WORKFLOW (1/5) ~~~
yamls/proto_nd_flow/workflows/charge/charge_event_building.yaml
~~~~~~~~~~~~~~~~

~~~ INIT ~~~
Hello
create RunData() /global/cfs/cdirs/dune/users/sfogarty//verticalMuons_1000_events_2x2_1GeV_to_5GeV_larndsim_newLightNoise.h5
create RawEventGenerator(charge/raw_events) /global/cfs/cdirs/dune/users/sfogarty//verticalMuons_1000_events_2x2_1GeV_to_5GeV_larndsim_newLightNoise.h5
RunData.init(charge/raw_events)
WARNING:root:Source dataset charge/raw_events has no inputfile in metadata stored under 'input_filename', using {self.input_filename} for RunData lookup
WARNING:root:Could not find row matching /global/cfs/cdirs/dune/users/sfogarty//verticalMuons_1000_events_2x2_1GeV_to_5GeV_larndsim_newLightNoise.h5 in data/proto_nd_flow/runlist-2x2-mcexample.txt
RawEventGenerator.init()
generating truth references: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 999/999 [00:08<00:00, 112.76it/s]
~~~~~~~~~~~~

~~~ RUN ~~~
Run loop on charge/raw_events:
  
  0%|                                                                                                                                        | 0/2 [00:00<?, ?it/s]

The macro used to make the sample is the following:

/edep/random/timeRandomSeed
/edep/gdml/read Merged2x2MINERvA_v3_withRock.gdml
/edep/phys/ionizationModel 0
/edep/hitSeparation volLArActive -1 mm
/edep/update
/gps/pos/type Volume
/gps/pos/shape Para
/gps/pos/centre 0.0 -100.0 1300 cm
/gps/pos/halfx 64.365 cm
/gps/pos/halfy 1.0 cm
/gps/pos/halfz 64.750 cm
#/gps/ang/type iso
/gps/direction 0 -1 0
/gps/particle mu-
/gps/ene/type User
/gps/hist/type energy
/gps/hist/point 1000 0.2
/gps/hist/point 2000 0.2
/gps/hist/point 3000 0.2
/gps/hist/point 4000 0.2
/gps/hist/point 5000 0.2

Here is the script I use to run ndlar_flow:


#!/usr/bin/env bash

module load python

source flow.venv/bin/activate

inDir=/global/cfs/cdirs/dune/users/sfogarty/
outDir=/global/cfs/cdirs/dune/users/sfogarty/

inFile=$inDir/verticalMuons_1000_events_2x2_1GeV_to_5GeV_larndsim_newLightNoise.h5
outFile=$outDir/verticalMuons_1000_events_2x2_1GeV_to_5GeV_larndsim_newLightNoise_flow.h5

# charge workflows
workflow1='yamls/proto_nd_flow/workflows/charge/charge_event_building.yaml'
workflow2='yamls/proto_nd_flow/workflows/charge/charge_event_reconstruction.yaml'
workflow3='yamls/proto_nd_flow/workflows/combined/combined_reconstruction.yaml'
workflow4='yamls/proto_nd_flow/workflows/charge/prompt_calibration.yaml'
workflow5='yamls/proto_nd_flow/workflows/charge/final_calibration.yaml'

# light workflows
workflow6='yamls/proto_nd_flow/workflows/light/light_event_building_mc.yaml'
workflow7='yamls/proto_nd_flow/workflows/light/light_event_reconstruction.yaml'

cd ndlar_flow
pip install .

h5flow -c $workflow1 $workflow2 $workflow3 $workflow4 $workflow5\
    -i $inFile -o $outFile
h5flow -c $workflow6 $workflow7\
    -i $inFile -o $outFile

Hit merging: second bumps in Q and E distributions

Currently the hit merger (https://github.com/DUNE/ndlar_flow/blob/develop/src/proto_nd_flow/reco/charge/calib_hit_merger.py) yields a second bump in Q and E distribution towards higher energy deposition in the calib_final_hits. This shoulder feature appears in both the data files and the simulation files and seems to be unphysical. (https://indico.fnal.gov/event/63652/contributions/286004/attachments/175752/238516/20240227_2x2_M1_reflow_checkup.pdf.
Screenshot 2024-03-07 at 1 14 42 AM
I think the current hit merger (5febf1c) only uses the time differences of packets on the same pixel. Perhaps we can judge based on charge whether they should be merged or not. Extending to the truth backtracking/labeling, perhaps in the simulation, we could label "induction" and "collection" charge. If necessary, we can investigate the merge within neighbouring pixels as well.

External trigger event builder

Create an external trigger-based event builder that uses external trigger packets to identify charge that should be collected into events.

"crs_geometry_file" to support previous configuration

Perhaps we should revert the name of crs_geometry_files to crs_geometry_file, and allow single string entrance in addition to a list, so we can continue use same configuration for single module or so.

crs_geometry_files: ['data/proto_nd_flow/multi_tile_layout-2.4.16.yaml',
'data/proto_nd_flow/multi_tile_layout-2.5.16.yaml']
crs_geometry_to_module: [0,0,1,0]

crs_geometry_files=self.crs_geometry_files,
crs_geometry_to_module=self.crs_geometry_to_module,

Vertices information not propagated in the flow

  1. Do we want to have vertex time information? Currently it's not propagated in ndlar_flow. I imagine if we care about different interactions in the same spill/event, this information may be useful.
  2. I have disabled the vertex information to be recorded if it's not from a neutrino generator, because the interaction information is inherited from mc_hdr. Perhaps, I should copy the vertices dataset in the flow output, even it's non neutrino simulation.

Fix geometry_info dataset in output

The objects in the geometry_info dataset in the output file are LUT objects as defined in protond_flow.util.lut. At the moment only the data array and not the metadata is written to file. Without the metadata it is impossible to load the LUT in an external script.

proto_nd_flow documentation

We need to setup some documentation that clearly describes the content of the various datasets produced by proto_nd_flow. Event reconstruction packages and other consumers of proto_nd_flow's output should have a clear place to determine exactly what is contained inside the datasets. Setting up a readthedocs to automatically generate and update this documentation would be ideal, to keep the code and the documentation in sync.

Reduce redundant code by making common areas

ndlar_flow has different subdirectories inside of ndlar_flow/src and ndlar_flow/yamls that specify whether we are running on 2x2 or a single module. In the future, there will be even more geometries and associated workflows, e.g. for FSD, ND-LAr, etc. It would be good to have a common area, e.g. ndlar_flow/{src,yamls}/common, for the code that can be used in a way that is agnostic to this choice.

remove hardcoded geometry assumptions from geometry resource

We should have a common geometry resource module (living somewhere like ndlar_flow/src/common/resource/geometry.py) that is completely void of any hardcoded assumptions about the geometry, and takes everything in from the appropriate yaml. For example, there are hardcoded assumptions about number of tiles per module and number of modules per detector here: https://github.com/DUNE/ndlar_flow/blob/develop/src/proto_nd_flow/resources/geometry.py#L730C69-L734

Event building improvement

  • 1. Move it to proto_nd_flow
  • 2. Configure the readout window/ drift window to be v_drift (e_field) dependent
  • 3. If it is a beam event, set the t0 to the middle of the beam window
  1. Remove nhit_cut for external trigger (I think set up nhit_cut to 1 is clean enough to achieve the goal.)

Documentation updates needed

Stale field in charge_event_reconstruction.yaml:hit_builder:params of geometry_file.

Add nersc links to download external data files.

Remove requirement of combined/t0 from combined_reconstruction.yaml:tracklet_reco

Add docstring to geometry.py:Geometry

Add docstring to disabled_channels.py:DisabledChannels

Propogate new MC truth information

The latest version of larnd-sim includes both the charge fraction for true tracks and trajectory information. This needs to be carried along to the output file.

Use a different YAML parser

This is really an h5flow issue but I can't seem to create GitHub issues for h5flow. Anyway. Quoting @diaza:

Hey guys, I think I caught a small bug in the gain inputs. When I was using these gain values for the reflow: [...]
I kept getting cwfm outputs with channels 8 and 9 at zero. I think it's because the leading 0s for channels 4-9 are making python read them as octals, and 08 and 09 are undefined as octals (edited)

image

PyYAML only supports YAML 1.1 instead of the slightly-less-insane YAML 1.2, where a leading zero doesn't mean "I'm octal!" Who knows what other silent bugs are being caused by such features of the YAML spec.

ruamel.yaml is probably a good alternative: https://stackoverflow.com/questions/36514166/yaml-octal-integer-generate-errors

There's even https://github.com/meggiman/ruamel.yaml-include to replace pyyaml-include

Add workflow tests

Adding a workflow to process an example set of small datafile would help insure that the full workflow runs. To do this, I would need a small charge datafile and light data file that are associated.

Remove outdated workflow

There is a light workflow file (reco/light/light_event_reconstruction.yaml) that is out of date and should be deleted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.