act-cutflow

In this project I am trying to adapt the cuts pipeline codes in ACT (moby2) and make it easier to integrate with machine learning pipeline. In particular, the current cuts pipeline is broken into pieces which will allow us to build more sophisticated pipelines on top of what’s existing quickly. The results of the pipeline is often a standalone file that summarizes all the statistics of interests and maybe a piece of the actual TOD or fourior transformed TOD data. This file can be supplied as an input to the machine learning pipeline like mlpipe. Note that the codes are meant to run on Feynman cluster.

Dependencies

The codes here depend on the todloop project which is basically a giant loop to iterate through a list of TODs. To install todloop,

pip install git+https://github.com/guanyilun/todloop

It also depends on

h5py (to generate summary file for machine learning pipeline)
inquirer (optional: used in one of the executable scripts to inspect datasets)

Run the codes

Most of the scripts in this repo contain only definition of different Routine (more on this later). These routines can be added to a pipeline that executes on a list of TODs.

The adapted cuts pipelines are defined in the project root directory. Specifically, I am using a convention that each pipeline is named after the tag that the original list of TODs belongs to. For example, pa3_f90_s16_c10_v1.py defines the pipeline for tag pa3_f90_s16_c10_v1, to run it,

python pa3_f90_s16_c10_v1.py

This will run the cut pipeline on on some predefined lists of TODs defined in the inputs folder, and in this case,

pa3_f90_s16_c10_v1_test.txt
pa3_f90_s16_c10_v1_train.txt
pa3_f90_s16_c10_v1_validate.txt

Architecture

The basic unit in the analysis is called a Routine. It represents a set of operations related to a specific task or analysis. For instance, some of the routines that are applied to each TOD before the multi-frequencies analysis includes applying MCE cuts, planet cuts, source cuts, detrends, applying filter cuts, etc. Each of these steps will be a separate Routine. In this way we can have a more modularized design as all the routines are independent of the other ones, and they are targeted at a specific task. This also makes the codes more readable and reusable.

The way that different routines communicate with each other is through a shared DataStore object. One can think of it as a giant dictionary (it actually is) that every routine can access. A typical workflow is to load some data associated with a key through the data store and process it, the output can be stored under another key in the data store so other routines can access it if needed.

Since most of the analysis related to cuts will involve looping through a list of TODs, and this is taken care by the underlying framework that this project is built on (called TODLoop). In this framework, each routine has a few hooks to the different stages of the analysis, for example, an empty routine looks like this.

class AnEmptyRoutine(Routine):
    def __init__(self, **params):
        Routine.__init__(self)
      
    def initialize(self):
        """Scripts that run before processing the first TOD"""
        pass

    def execute(self, store):
        """Scripts that run for each TOD"""
        pass

    def finalize(self):
        """Scripts that run after processing all TODs"""
        pass

initialize contains scripts that will be ran before processing all tods, execute function contains script that will be ran for each TOD, and it is typically where most of the analysis is performed. It also provides a reference to the shared data store so one can retrieve relevant information that are provided by other routines. finalize function contains the codes that should run after looping through all tods, kind of clean-up stages.

An actual working example of a routine is like this

class FindJumps(Routine):
    def __init__(self, **params):
        Routine.__init__(self)
        self.inputs = params.get('inputs', None)
        self.outputs = params.get('outputs', None)
        self._dsStep = params.get('dsStep', None)
        self._window = params.get('window', None)

    def execute(self, store):
        tod = store.get(self.inputs.get('tod'))

        # find jumps
        jumps = moby2.libactpol.find_jumps(tod.data,
                                           self._dsStep,
                                           self._window)
        # store the jumps values
        crit = {
            'jumpLive': jumps,
            'jumpDark': jumps,
        }
        
        # save to data store
        store.set(self.outputs.get('jumps'), crit)

This showcases the typical structure of a routine. One first retrieves the relevant data from the data store, and export the transformed data back to the data store after processing.

The parameters are provided in a driver program, with the relevant section that looks like

# add a routine to find jumps in TOD
jump_params = {
    'inputs': {
        'tod': 'tod'
    },
    'outputs':{
        'jumps': 'jumps'
    },
    'dsStep': 4,
    'window': 1,
}
loop.add_routine(FindJumps(**jump_params))

Here loop refers to an underlying loop that will iterate over a list of TODs. Routine specific parameters are supplied by a dictionary. inputs here contains the data that the routine requires and where to find it in the shared data store. Similarly, outputs here specifies the data that the routine exports and where other routines can access it. Consider another example,

routine1_param = {
    'outputs': {
        'tod': 'tod-key'
    }
}
loop.add_routine(Routine1(**routine1_param))

This adds a routine called Routine1 in the pipeline. It demands no inputs and exports a tod data. This data will be stored in the shared data store that all routines can access, and it is associated with a key ‘tod-key’. If another routine requires tod data as an input,

routine2_param = {
    'inputs': {
        'tod': 'tod-key'
    },
    'outputs': {
        'processed_tod': 'another-key'
    }
}
loop.add_routine(Routine2(**routine2_param))

One can specify the associated key in the inputs and the data will be accessible. Similarly the processed data can be exported again and be accessible by other routines. The purpose of this is to have better encapsulation of various independent routine components.

A complete pipeline definition for the previous example would look like

from todloop import TODLoop

# initialize loop
loop = TODLoop()

# specify tod list to process
loop.add_tod_list("your_list_of_tods.txt")

# add routine 1
routine1_param = {
    'outputs': {
        'tod': 'tod-key'
    }
}
loop.add_routine(Routine1(**routine1_param))

# add routine 2
routine2_param = {
    'inputs': {
        'tod': 'tod-key'
    },
    'outputs': {
        'processed_tod': 'another-key'
    }
}
loop.add_routine(Routine2(**routine2_param))

# execute pipeline for the first 100 TODs in the list
loop.run(0,100)

The Cut Pipeline

Here is a rough sketch of some of the routines in the existing pipeline and their whereabouts in this repository.

steps applied	moby2	here	name
cut mce	process_cuts.py	routines/cuts.py	CutPartial
cut planets	process_cuts.py	routines/cuts.py	CutPlanets
cut sources	process_cuts.py	routines/cuts.py	CutSources
cut glitches	process_cuts.py	routines/cuts.py	CutPartial
remove hwp	process_cuts.py	routines/cuts.py	SubstractHWP
remove mean	process_cuts.py	routines/tod.py	TransformTOD
detrend	process_cuts.py	routines/tod.py	TransformTOD
remove filter gain	process_cuts.py	routines/tod.py	TransformTOD
downsample	process_cuts.py	routines/tod.py	TransformTOD
find zero detectors	pathologies.py	routines/tod.py	GetDetectors
find jumps	pathologies.py	routines/cuts.py	FindJumps
calibrate to pW	pathologies.py	routines/tod.py	CalibrateTOD
analyze scans	pathologies.py	routines/analysis.py	AnalyzeScan
fourior transform	pathologies.py	routines/tod.py	FouriorTransform
multi-freq analysis	pathologies.py	routines/analysis.py	AnalyzeDarkLF …

Files

routines/cuts.py: cuts related routines
routines/tod.py: tod related routines
routines/analysis.py: mainly the multi-freq analysis, also some temperature analysis, scan analysis, etc.
routines/utils.py: some utility functions such nextregular for fft preselection functions
routines/report.py: routines related to reporting the results of analysis
routines/features.py: design new features that may be useful
TAGNAME.py: the driver programs for running the pipeline on feynman, it defines the pipeline and specifies the parameters inputs for each routine.

Status Quo

Currently the pipeline consists of the following routines (example output):

2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: TODLoader
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutSources
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutPlanets
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: RemoveSyncPickup
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutPartial
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: TransformTOD
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: AnalyzeScan
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: GetDetectors
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CalibrateTOD
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: FindJumps
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: FouriorTransform
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeDarkLF
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeLiveLF
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: GetDriftErrors
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeLiveMF
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeHF
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: JesseFeatures
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: Summarize
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: PrepareDataLabelNew

The parameters that can be computed include

['darkRatioLive',
 'corrLive',
 'corrDark',
 'kurtLive',
 'normLive',
 'kurtpLive',
 'normDark',
 'MFELive',
 'skewLive',
 'gainLive',
 'DELive',
 'gainDark',
 'jumpDark',
 'rmsDark',
 'jumpLive',
 'rmsLive',
 'darkSel',
 'skewpLive',
 'feat1',  # 4 additional features implemented by Jesse
 'feat2',  # for more details find JesseFeatures routine
 'feat3',  # in routines/features.py
 'feat5']

Descriptions of Routines

A brief description of each of these routines and where to find it

Routine	Description	Location
TODLoader	Load TOD into data store	todloop
CutSources	Remove sources from TOD data	cuts.py
CutPlanets	Remove planet from TOD data	cuts.py
RemoveSyncPickup	Remove sync pickup from TOD data	cuts.py
Cut Partial	Remove glitches and MCE errors	cuts.py
TransformTOD	Downsampling, detrend, remove mean, etc	tod.py
AnalyzeScan	Find scan freq and other scan parameters	analysis.py
GetDetectors	Find live and dark detector candidates	tod.py
CalibrateTOD	Calibrate to pW using flatfield and responsivity	tod.py
FindJumps	Find jumps and calculate jumpLive, jumpDark	cuts.py
FouriorTransform	Simple fourior transform	tod.py
AnalyzeDarkLF	Study dark detectors in low frequency, calculate	analysis.py
	corrDark, normDark, gainDark
AnalyzeLiveLF	Study live detectors in low frequency, calculate	analysis.py
	corrLive, normLive, gainLive, darkRatioLive
GetDriftErrors	Study the slow modes and calculate DELive	analysis.py
AnalyzeLiveMF	Study the live detectors in mid frequency,	analysis.py
	calculate MFELive
AnalyzeHF	Study both the live and dark detectors in high	analysis.py
	frequency and calculate rmsLive, kurtLive,
	skewLive, rmsDark
Summarize	Get the results from previous routine and combine	report.py
	them into a dictionary
PrepareDataLabel	Load analysis results and sel from Pickle file	report.py
(New)	to create an h5 file which can be supplied to
	the mlpipe pipeline
JesseFeatures	Calculate the 4 features that Jesse came up with	features.py

Major Differences

While breaking the moby2 cuts codes into individual components. There are some changes made to the pipeline for exploration. Here is a list of them:

Pre-selection

Pre-selection in moby2 requires a fine tuning of parameters. In particular, the presel_by_group function alone requires 5 parameters to tune. The presel_by_median function requires 3 parameters to tune. Since our objective is to reduce the human intervention as much as possible, the pre-selection is removed. The idea is to use some smarter algorithms to replace this fine tuning process. More on this later.

Partial statistics

For the high frequency analysis, the original pipeline in moby2 performs the analysis on each chunk in the scan (between turning points). This is not enabled for now for simplicity.

Machine Learning

The PrepareDataLabel routine makes way for the machine learning study by preparing an h5 file with all the necessary data to train machine learning models that can directly be supplied to the machine learning pipeline codes (mlpipe). An example output from this machine learning pipeline is shown below

== VALIDATION RESULTS: ==

  epoch    batch  model               loss      base    accuracy    tp    tn    fp    fn    precision    recall        f1     time/s
-------  -------  ---------------  -------  --------  ----------  ----  ----  ----  ----  -----------  --------  --------  ---------
      0        0  KNNModel-3       2.05077  0.422877    0.940625  6864  9089   699   308     0.907576  0.957055  0.931659  2.09715
      0        0  KNNModel-7       1.7005   0.422877    0.950767  7088  9037   751    84     0.904197  0.988288  0.944374  2.09129
      0        0  RandomForest-5   1.41335  0.422877    0.95908   7154  9112   676    18     0.913665  0.99749   0.95374   0.0665109
      0        0  KNNModel-5       1.81862  0.422877    0.947347  7012  9055   733   160     0.905358  0.977691  0.940135  2.08214
      0        0  XGBoost          1.38688  0.422877    0.959847  7157  9122   666    15     0.914866  0.997909  0.954585  0.0552425
      0        0  DecisionTree     1.86952  0.422877    0.945873  6862  9180   608   310     0.918608  0.956776  0.937304  0.0112839
      0        0  RandomForest-20  1.40724  0.422877    0.959257  7153  9116   672    19     0.914121  0.997351  0.953924  0.178965
      0        0  SVCModel         1.76771  0.422877    0.948821  7172  8920   868     0     0.89204   1         0.94294   5.48178
      0        0  RandomForest-10  1.40521  0.422877    0.959316  7157  9113   675    15     0.913815  0.997909  0.954012  0.102943

It shows that even after removing some major fine tuning steps we can achieve reasonably good results. This is a hint that the existing cut pipeline can be simplified furthur with the help of machine learning techniques.

guanyilun / act-cutflow Goto Github PK

act-cutflow's Introduction

act-cutflow

Dependencies

Run the codes

Architecture

The Cut Pipeline

Files

Status Quo

Descriptions of Routines

Major Differences

Machine Learning

act-cutflow's People

Contributors

Watchers

act-cutflow's Issues

Find rebias gap time

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent