In this project I am trying to adapt the cuts pipeline codes in ACT (moby2) and make it easier to integrate with machine learning pipeline. In particular, the current cuts pipeline is broken into pieces which will allow us to build more sophisticated pipelines on top of what’s existing quickly. The results of the pipeline is often a standalone file that summarizes all the statistics of interests and maybe a piece of the actual TOD or fourior transformed TOD data. This file can be supplied as an input to the machine learning pipeline like mlpipe. Note that the codes are meant to run on Feynman cluster.
The codes here depend on the todloop project which is basically a giant loop to iterate through a list of TODs. To install todloop,
pip install git+https://github.com/guanyilun/todloop
It also depends on
- h5py (to generate summary file for machine learning pipeline)
- inquirer (optional: used in one of the executable scripts to inspect datasets)
Most of the scripts in this repo contain only definition of different
Routine
(more on this later). These routines can be added to a
pipeline that executes on a list of TODs.
The adapted cuts pipelines are defined in the project root
directory. Specifically, I am using a convention that each pipeline is
named after the tag that the original list of TODs belongs to. For
example, pa3_f90_s16_c10_v1.py
defines the pipeline for tag
pa3_f90_s16_c10_v1
, to run it,
python pa3_f90_s16_c10_v1.py
This will run the cut pipeline on on some predefined lists of TODs defined in the inputs folder, and in this case,
pa3_f90_s16_c10_v1_test.txt pa3_f90_s16_c10_v1_train.txt pa3_f90_s16_c10_v1_validate.txt
The basic unit in the analysis is called a Routine
. It represents a
set of operations related to a specific task or analysis. For
instance, some of the routines that are applied to each TOD before the
multi-frequencies analysis includes applying MCE cuts, planet cuts,
source cuts, detrends, applying filter cuts, etc. Each of these steps
will be a separate Routine
. In this way we can have a more
modularized design as all the routines are independent of the other
ones, and they are targeted at a specific task. This also makes the
codes more readable and reusable.
The way that different routines communicate with each other is through
a shared DataStore
object. One can think of it as a giant dictionary (it
actually is) that every routine can access. A typical workflow is to
load some data associated with a key through the data store and
process it, the output can be stored under another key in the data
store so other routines can access it if needed.
Since most of the analysis related to cuts will involve looping through a list of TODs, and this is taken care by the underlying framework that this project is built on (called TODLoop). In this framework, each routine has a few hooks to the different stages of the analysis, for example, an empty routine looks like this.
class AnEmptyRoutine(Routine):
def __init__(self, **params):
Routine.__init__(self)
def initialize(self):
"""Scripts that run before processing the first TOD"""
pass
def execute(self, store):
"""Scripts that run for each TOD"""
pass
def finalize(self):
"""Scripts that run after processing all TODs"""
pass
initialize
contains scripts that will be ran before processing all
tods, execute
function contains script that will be ran for each
TOD, and it is typically where most of the analysis is performed. It
also provides a reference to the shared data store so one can retrieve
relevant information that are provided by other routines. finalize
function contains the codes that should run after looping through all
tods, kind of clean-up stages.
An actual working example of a routine is like this
class FindJumps(Routine):
def __init__(self, **params):
Routine.__init__(self)
self.inputs = params.get('inputs', None)
self.outputs = params.get('outputs', None)
self._dsStep = params.get('dsStep', None)
self._window = params.get('window', None)
def execute(self, store):
tod = store.get(self.inputs.get('tod'))
# find jumps
jumps = moby2.libactpol.find_jumps(tod.data,
self._dsStep,
self._window)
# store the jumps values
crit = {
'jumpLive': jumps,
'jumpDark': jumps,
}
# save to data store
store.set(self.outputs.get('jumps'), crit)
This showcases the typical structure of a routine. One first retrieves the relevant data from the data store, and export the transformed data back to the data store after processing.
The parameters are provided in a driver program, with the relevant section that looks like
# add a routine to find jumps in TOD
jump_params = {
'inputs': {
'tod': 'tod'
},
'outputs':{
'jumps': 'jumps'
},
'dsStep': 4,
'window': 1,
}
loop.add_routine(FindJumps(**jump_params))
Here loop
refers to an underlying loop that will iterate over a list
of TODs. Routine specific parameters are supplied by a
dictionary. inputs
here contains the data that the routine requires
and where to find it in the shared data store. Similarly, outputs
here
specifies the data that the routine exports and where other routines can
access it. Consider another example,
routine1_param = {
'outputs': {
'tod': 'tod-key'
}
}
loop.add_routine(Routine1(**routine1_param))
This adds a routine called Routine1
in the pipeline. It demands no
inputs and exports a tod
data. This data will be stored in the
shared data store that all routines can access, and it is associated
with a key ‘tod-key’. If another routine requires tod data as an
input,
routine2_param = {
'inputs': {
'tod': 'tod-key'
},
'outputs': {
'processed_tod': 'another-key'
}
}
loop.add_routine(Routine2(**routine2_param))
One can specify the associated key in the inputs and the data will be accessible. Similarly the processed data can be exported again and be accessible by other routines. The purpose of this is to have better encapsulation of various independent routine components.
A complete pipeline definition for the previous example would look like
from todloop import TODLoop
# initialize loop
loop = TODLoop()
# specify tod list to process
loop.add_tod_list("your_list_of_tods.txt")
# add routine 1
routine1_param = {
'outputs': {
'tod': 'tod-key'
}
}
loop.add_routine(Routine1(**routine1_param))
# add routine 2
routine2_param = {
'inputs': {
'tod': 'tod-key'
},
'outputs': {
'processed_tod': 'another-key'
}
}
loop.add_routine(Routine2(**routine2_param))
# execute pipeline for the first 100 TODs in the list
loop.run(0,100)
Here is a rough sketch of some of the routines in the existing pipeline and their whereabouts in this repository.
steps applied | moby2 | here | name |
---|---|---|---|
cut mce | process_cuts.py | routines/cuts.py | CutPartial |
cut planets | process_cuts.py | routines/cuts.py | CutPlanets |
cut sources | process_cuts.py | routines/cuts.py | CutSources |
cut glitches | process_cuts.py | routines/cuts.py | CutPartial |
remove hwp | process_cuts.py | routines/cuts.py | SubstractHWP |
remove mean | process_cuts.py | routines/tod.py | TransformTOD |
detrend | process_cuts.py | routines/tod.py | TransformTOD |
remove filter gain | process_cuts.py | routines/tod.py | TransformTOD |
downsample | process_cuts.py | routines/tod.py | TransformTOD |
find zero detectors | pathologies.py | routines/tod.py | GetDetectors |
find jumps | pathologies.py | routines/cuts.py | FindJumps |
calibrate to pW | pathologies.py | routines/tod.py | CalibrateTOD |
analyze scans | pathologies.py | routines/analysis.py | AnalyzeScan |
fourior transform | pathologies.py | routines/tod.py | FouriorTransform |
multi-freq analysis | pathologies.py | routines/analysis.py | AnalyzeDarkLF … |
routines/cuts.py
: cuts related routinesroutines/tod.py
: tod related routinesroutines/analysis.py
: mainly the multi-freq analysis, also some temperature analysis, scan analysis, etc.routines/utils.py
: some utility functions suchnextregular
for fft preselection functionsroutines/report.py
: routines related to reporting the results of analysisroutines/features.py
: design new features that may be usefulTAGNAME.py
: the driver programs for running the pipeline on feynman, it defines the pipeline and specifies the parameters inputs for each routine.
Currently the pipeline consists of the following routines (example output):
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: TODLoader 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutSources 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutPlanets 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: RemoveSyncPickup 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutPartial 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: TransformTOD 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: AnalyzeScan 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: GetDetectors 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CalibrateTOD 2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: FindJumps 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: FouriorTransform 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeDarkLF 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeLiveLF 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: GetDriftErrors 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeLiveMF 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeHF 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: JesseFeatures 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: Summarize 2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: PrepareDataLabelNew
The parameters that can be computed include
['darkRatioLive', 'corrLive', 'corrDark', 'kurtLive', 'normLive', 'kurtpLive', 'normDark', 'MFELive', 'skewLive', 'gainLive', 'DELive', 'gainDark', 'jumpDark', 'rmsDark', 'jumpLive', 'rmsLive', 'darkSel', 'skewpLive', 'feat1', # 4 additional features implemented by Jesse 'feat2', # for more details find JesseFeatures routine 'feat3', # in routines/features.py 'feat5']
A brief description of each of these routines and where to find it
Routine | Description | Location |
---|---|---|
TODLoader | Load TOD into data store | todloop |
CutSources | Remove sources from TOD data | cuts.py |
CutPlanets | Remove planet from TOD data | cuts.py |
RemoveSyncPickup | Remove sync pickup from TOD data | cuts.py |
Cut Partial | Remove glitches and MCE errors | cuts.py |
TransformTOD | Downsampling, detrend, remove mean, etc | tod.py |
AnalyzeScan | Find scan freq and other scan parameters | analysis.py |
GetDetectors | Find live and dark detector candidates | tod.py |
CalibrateTOD | Calibrate to pW using flatfield and responsivity | tod.py |
FindJumps | Find jumps and calculate jumpLive, jumpDark | cuts.py |
FouriorTransform | Simple fourior transform | tod.py |
AnalyzeDarkLF | Study dark detectors in low frequency, calculate | analysis.py |
corrDark, normDark, gainDark | ||
AnalyzeLiveLF | Study live detectors in low frequency, calculate | analysis.py |
corrLive, normLive, gainLive, darkRatioLive | ||
GetDriftErrors | Study the slow modes and calculate DELive | analysis.py |
AnalyzeLiveMF | Study the live detectors in mid frequency, | analysis.py |
calculate MFELive | ||
AnalyzeHF | Study both the live and dark detectors in high | analysis.py |
frequency and calculate rmsLive, kurtLive, | ||
skewLive, rmsDark | ||
Summarize | Get the results from previous routine and combine | report.py |
them into a dictionary | ||
PrepareDataLabel | Load analysis results and sel from Pickle file | report.py |
(New) | to create an h5 file which can be supplied to | |
the mlpipe pipeline | ||
JesseFeatures | Calculate the 4 features that Jesse came up with | features.py |
While breaking the moby2 cuts codes into individual components. There are some changes made to the pipeline for exploration. Here is a list of them:
- Pre-selection
Pre-selection in moby2 requires a fine tuning of parameters. In particular,
the presel_by_group
function alone requires 5 parameters to tune. The
presel_by_median
function requires 3 parameters to tune. Since our objective
is to reduce the human intervention as much as possible, the pre-selection
is removed. The idea is to use some smarter algorithms to replace this
fine tuning process. More on this later.
- Partial statistics
For the high frequency analysis, the original pipeline in moby2 performs the analysis on each chunk in the scan (between turning points). This is not enabled for now for simplicity.
The PrepareDataLabel
routine makes way for the machine learning
study by preparing an h5 file with all the necessary data to train
machine learning models that can directly be supplied to the machine
learning pipeline codes (mlpipe). An example output from this
machine learning pipeline is shown below
== VALIDATION RESULTS: == epoch batch model loss base accuracy tp tn fp fn precision recall f1 time/s ------- ------- --------------- ------- -------- ---------- ---- ---- ---- ---- ----------- -------- -------- --------- 0 0 KNNModel-3 2.05077 0.422877 0.940625 6864 9089 699 308 0.907576 0.957055 0.931659 2.09715 0 0 KNNModel-7 1.7005 0.422877 0.950767 7088 9037 751 84 0.904197 0.988288 0.944374 2.09129 0 0 RandomForest-5 1.41335 0.422877 0.95908 7154 9112 676 18 0.913665 0.99749 0.95374 0.0665109 0 0 KNNModel-5 1.81862 0.422877 0.947347 7012 9055 733 160 0.905358 0.977691 0.940135 2.08214 0 0 XGBoost 1.38688 0.422877 0.959847 7157 9122 666 15 0.914866 0.997909 0.954585 0.0552425 0 0 DecisionTree 1.86952 0.422877 0.945873 6862 9180 608 310 0.918608 0.956776 0.937304 0.0112839 0 0 RandomForest-20 1.40724 0.422877 0.959257 7153 9116 672 19 0.914121 0.997351 0.953924 0.178965 0 0 SVCModel 1.76771 0.422877 0.948821 7172 8920 868 0 0.89204 1 0.94294 5.48178 0 0 RandomForest-10 1.40521 0.422877 0.959316 7157 9113 675 15 0.913815 0.997909 0.954012 0.102943
It shows that even after removing some major fine tuning steps we can achieve reasonably good results. This is a hint that the existing cut pipeline can be simplified furthur with the help of machine learning techniques.