Git Product home page Git Product logo

act-cutflow's Introduction

act-cutflow

In this project I am trying to adapt the cuts pipeline codes in ACT (moby2) and make it easier to integrate with machine learning pipeline. In particular, the current cuts pipeline is broken into pieces which will allow us to build more sophisticated pipelines on top of what’s existing quickly. The results of the pipeline is often a standalone file that summarizes all the statistics of interests and maybe a piece of the actual TOD or fourior transformed TOD data. This file can be supplied as an input to the machine learning pipeline like mlpipe. Note that the codes are meant to run on Feynman cluster.

Dependencies

The codes here depend on the todloop project which is basically a giant loop to iterate through a list of TODs. To install todloop,

pip install git+https://github.com/guanyilun/todloop

It also depends on

  • h5py (to generate summary file for machine learning pipeline)
  • inquirer (optional: used in one of the executable scripts to inspect datasets)

Run the codes

Most of the scripts in this repo contain only definition of different Routine (more on this later). These routines can be added to a pipeline that executes on a list of TODs.

The adapted cuts pipelines are defined in the project root directory. Specifically, I am using a convention that each pipeline is named after the tag that the original list of TODs belongs to. For example, pa3_f90_s16_c10_v1.py defines the pipeline for tag pa3_f90_s16_c10_v1, to run it,

python pa3_f90_s16_c10_v1.py

This will run the cut pipeline on on some predefined lists of TODs defined in the inputs folder, and in this case,

pa3_f90_s16_c10_v1_test.txt
pa3_f90_s16_c10_v1_train.txt
pa3_f90_s16_c10_v1_validate.txt

Architecture

The basic unit in the analysis is called a Routine. It represents a set of operations related to a specific task or analysis. For instance, some of the routines that are applied to each TOD before the multi-frequencies analysis includes applying MCE cuts, planet cuts, source cuts, detrends, applying filter cuts, etc. Each of these steps will be a separate Routine. In this way we can have a more modularized design as all the routines are independent of the other ones, and they are targeted at a specific task. This also makes the codes more readable and reusable.

The way that different routines communicate with each other is through a shared DataStore object. One can think of it as a giant dictionary (it actually is) that every routine can access. A typical workflow is to load some data associated with a key through the data store and process it, the output can be stored under another key in the data store so other routines can access it if needed.

Since most of the analysis related to cuts will involve looping through a list of TODs, and this is taken care by the underlying framework that this project is built on (called TODLoop). In this framework, each routine has a few hooks to the different stages of the analysis, for example, an empty routine looks like this.

class AnEmptyRoutine(Routine):
    def __init__(self, **params):
        Routine.__init__(self)
      
    def initialize(self):
        """Scripts that run before processing the first TOD"""
        pass

    def execute(self, store):
        """Scripts that run for each TOD"""
        pass

    def finalize(self):
        """Scripts that run after processing all TODs"""
        pass

initialize contains scripts that will be ran before processing all tods, execute function contains script that will be ran for each TOD, and it is typically where most of the analysis is performed. It also provides a reference to the shared data store so one can retrieve relevant information that are provided by other routines. finalize function contains the codes that should run after looping through all tods, kind of clean-up stages.

An actual working example of a routine is like this

class FindJumps(Routine):
    def __init__(self, **params):
        Routine.__init__(self)
        self.inputs = params.get('inputs', None)
        self.outputs = params.get('outputs', None)
        self._dsStep = params.get('dsStep', None)
        self._window = params.get('window', None)

    def execute(self, store):
        tod = store.get(self.inputs.get('tod'))

        # find jumps
        jumps = moby2.libactpol.find_jumps(tod.data,
                                           self._dsStep,
                                           self._window)
        # store the jumps values
        crit = {
            'jumpLive': jumps,
            'jumpDark': jumps,
        }
        
        # save to data store
        store.set(self.outputs.get('jumps'), crit)

This showcases the typical structure of a routine. One first retrieves the relevant data from the data store, and export the transformed data back to the data store after processing.

The parameters are provided in a driver program, with the relevant section that looks like

# add a routine to find jumps in TOD
jump_params = {
    'inputs': {
        'tod': 'tod'
    },
    'outputs':{
        'jumps': 'jumps'
    },
    'dsStep': 4,
    'window': 1,
}
loop.add_routine(FindJumps(**jump_params))

Here loop refers to an underlying loop that will iterate over a list of TODs. Routine specific parameters are supplied by a dictionary. inputs here contains the data that the routine requires and where to find it in the shared data store. Similarly, outputs here specifies the data that the routine exports and where other routines can access it. Consider another example,

routine1_param = {
    'outputs': {
        'tod': 'tod-key'
    }
}
loop.add_routine(Routine1(**routine1_param))

This adds a routine called Routine1 in the pipeline. It demands no inputs and exports a tod data. This data will be stored in the shared data store that all routines can access, and it is associated with a key ‘tod-key’. If another routine requires tod data as an input,

routine2_param = {
    'inputs': {
        'tod': 'tod-key'
    },
    'outputs': {
        'processed_tod': 'another-key'
    }
}
loop.add_routine(Routine2(**routine2_param))

One can specify the associated key in the inputs and the data will be accessible. Similarly the processed data can be exported again and be accessible by other routines. The purpose of this is to have better encapsulation of various independent routine components.

A complete pipeline definition for the previous example would look like

from todloop import TODLoop

# initialize loop
loop = TODLoop()

# specify tod list to process
loop.add_tod_list("your_list_of_tods.txt")

# add routine 1
routine1_param = {
    'outputs': {
        'tod': 'tod-key'
    }
}
loop.add_routine(Routine1(**routine1_param))

# add routine 2
routine2_param = {
    'inputs': {
        'tod': 'tod-key'
    },
    'outputs': {
        'processed_tod': 'another-key'
    }
}
loop.add_routine(Routine2(**routine2_param))

# execute pipeline for the first 100 TODs in the list
loop.run(0,100)

The Cut Pipeline

Here is a rough sketch of some of the routines in the existing pipeline and their whereabouts in this repository.

steps appliedmoby2herename
cut mceprocess_cuts.pyroutines/cuts.pyCutPartial
cut planetsprocess_cuts.pyroutines/cuts.pyCutPlanets
cut sourcesprocess_cuts.pyroutines/cuts.pyCutSources
cut glitchesprocess_cuts.pyroutines/cuts.pyCutPartial
remove hwpprocess_cuts.pyroutines/cuts.pySubstractHWP
remove meanprocess_cuts.pyroutines/tod.pyTransformTOD
detrendprocess_cuts.pyroutines/tod.pyTransformTOD
remove filter gainprocess_cuts.pyroutines/tod.pyTransformTOD
downsampleprocess_cuts.pyroutines/tod.pyTransformTOD
find zero detectorspathologies.pyroutines/tod.pyGetDetectors
find jumpspathologies.pyroutines/cuts.pyFindJumps
calibrate to pWpathologies.pyroutines/tod.pyCalibrateTOD
analyze scanspathologies.pyroutines/analysis.pyAnalyzeScan
fourior transformpathologies.pyroutines/tod.pyFouriorTransform
multi-freq analysispathologies.pyroutines/analysis.pyAnalyzeDarkLF …

Files

  • routines/cuts.py: cuts related routines
  • routines/tod.py: tod related routines
  • routines/analysis.py: mainly the multi-freq analysis, also some temperature analysis, scan analysis, etc.
  • routines/utils.py: some utility functions such nextregular for fft preselection functions
  • routines/report.py: routines related to reporting the results of analysis
  • routines/features.py: design new features that may be useful
  • TAGNAME.py: the driver programs for running the pipeline on feynman, it defines the pipeline and specifies the parameters inputs for each routine.

Status Quo

Currently the pipeline consists of the following routines (example output):

2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: TODLoader
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutSources
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutPlanets
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: RemoveSyncPickup
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CutPartial
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: TransformTOD
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: AnalyzeScan
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: GetDetectors
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: CalibrateTOD
2019-03-09 10:02:05,156 [INFO] TODLoop: Added routine: FindJumps
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: FouriorTransform
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeDarkLF
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeLiveLF
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: GetDriftErrors
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeLiveMF
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: AnalyzeHF
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: JesseFeatures
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: Summarize
2019-03-09 10:02:05,157 [INFO] TODLoop: Added routine: PrepareDataLabelNew

The parameters that can be computed include

['darkRatioLive',
 'corrLive',
 'corrDark',
 'kurtLive',
 'normLive',
 'kurtpLive',
 'normDark',
 'MFELive',
 'skewLive',
 'gainLive',
 'DELive',
 'gainDark',
 'jumpDark',
 'rmsDark',
 'jumpLive',
 'rmsLive',
 'darkSel',
 'skewpLive',
 'feat1',  # 4 additional features implemented by Jesse
 'feat2',  # for more details find JesseFeatures routine
 'feat3',  # in routines/features.py
 'feat5'] 


Descriptions of Routines

A brief description of each of these routines and where to find it

RoutineDescriptionLocation
TODLoaderLoad TOD into data storetodloop
CutSourcesRemove sources from TOD datacuts.py
CutPlanetsRemove planet from TOD datacuts.py
RemoveSyncPickupRemove sync pickup from TOD datacuts.py
Cut PartialRemove glitches and MCE errorscuts.py
TransformTODDownsampling, detrend, remove mean, etctod.py
AnalyzeScanFind scan freq and other scan parametersanalysis.py
GetDetectorsFind live and dark detector candidatestod.py
CalibrateTODCalibrate to pW using flatfield and responsivitytod.py
FindJumpsFind jumps and calculate jumpLive, jumpDarkcuts.py
FouriorTransformSimple fourior transformtod.py
AnalyzeDarkLFStudy dark detectors in low frequency, calculateanalysis.py
corrDark, normDark, gainDark
AnalyzeLiveLFStudy live detectors in low frequency, calculateanalysis.py
corrLive, normLive, gainLive, darkRatioLive
GetDriftErrorsStudy the slow modes and calculate DELiveanalysis.py
AnalyzeLiveMFStudy the live detectors in mid frequency,analysis.py
calculate MFELive
AnalyzeHFStudy both the live and dark detectors in highanalysis.py
frequency and calculate rmsLive, kurtLive,
skewLive, rmsDark
SummarizeGet the results from previous routine and combinereport.py
them into a dictionary
PrepareDataLabelLoad analysis results and sel from Pickle filereport.py
(New)to create an h5 file which can be supplied to
the mlpipe pipeline
JesseFeaturesCalculate the 4 features that Jesse came up withfeatures.py

Major Differences

While breaking the moby2 cuts codes into individual components. There are some changes made to the pipeline for exploration. Here is a list of them:

  • Pre-selection

Pre-selection in moby2 requires a fine tuning of parameters. In particular, the presel_by_group function alone requires 5 parameters to tune. The presel_by_median function requires 3 parameters to tune. Since our objective is to reduce the human intervention as much as possible, the pre-selection is removed. The idea is to use some smarter algorithms to replace this fine tuning process. More on this later.

  • Partial statistics

For the high frequency analysis, the original pipeline in moby2 performs the analysis on each chunk in the scan (between turning points). This is not enabled for now for simplicity.

Machine Learning

The PrepareDataLabel routine makes way for the machine learning study by preparing an h5 file with all the necessary data to train machine learning models that can directly be supplied to the machine learning pipeline codes (mlpipe). An example output from this machine learning pipeline is shown below

== VALIDATION RESULTS: ==

  epoch    batch  model               loss      base    accuracy    tp    tn    fp    fn    precision    recall        f1     time/s
-------  -------  ---------------  -------  --------  ----------  ----  ----  ----  ----  -----------  --------  --------  ---------
      0        0  KNNModel-3       2.05077  0.422877    0.940625  6864  9089   699   308     0.907576  0.957055  0.931659  2.09715
      0        0  KNNModel-7       1.7005   0.422877    0.950767  7088  9037   751    84     0.904197  0.988288  0.944374  2.09129
      0        0  RandomForest-5   1.41335  0.422877    0.95908   7154  9112   676    18     0.913665  0.99749   0.95374   0.0665109
      0        0  KNNModel-5       1.81862  0.422877    0.947347  7012  9055   733   160     0.905358  0.977691  0.940135  2.08214
      0        0  XGBoost          1.38688  0.422877    0.959847  7157  9122   666    15     0.914866  0.997909  0.954585  0.0552425
      0        0  DecisionTree     1.86952  0.422877    0.945873  6862  9180   608   310     0.918608  0.956776  0.937304  0.0112839
      0        0  RandomForest-20  1.40724  0.422877    0.959257  7153  9116   672    19     0.914121  0.997351  0.953924  0.178965
      0        0  SVCModel         1.76771  0.422877    0.948821  7172  8920   868     0     0.89204   1         0.94294   5.48178
      0        0  RandomForest-10  1.40521  0.422877    0.959316  7157  9113   675    15     0.913815  0.997909  0.954012  0.102943

It shows that even after removing some major fine tuning steps we can achieve reasonably good results. This is a hint that the existing cut pipeline can be simplified furthur with the help of machine learning techniques.

act-cutflow's People

Contributors

guanyilun avatar

Watchers

James Cloos avatar  avatar

act-cutflow's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.