chesscomputing / chessanalysispipeline Goto Github PK

CHESS pipeline framework

License: Other

Python 99.53% Shell 0.47%

chessanalysispipeline's Introduction

ChessAnalysisPipeline (CHAP)

CHAP is a package that provides a framework for executing data anlaysis pipelines. The package can be found on PyPI and conda-forge.

Subpackages

There are several subpackages within CHAP that contain specialized items to handle specific types of data processing in the CHAP framework. Dependencies for these subpackages can be found in CHAP/<subpackage_name>/environment.yml.

Documentation

Documentation for the latest version can be found on this project's github pages site.

Galaxy

The galaxy-tools/ directory contains a set of CHAP-based tools for use in the Galaxy science gateway.

chessanalysispipeline's People

Contributors

Stargazers

Watchers

Forkers

vkuznet rolfverberg keara-soloway rf446 arwoll

chessanalysispipeline's Issues

Request option to set detector config defaults in EDD Calibration

The new / recommended workflow for EDD detector calibration calls for an Energy calibration step followed by a tth calibration. Currently, the output file from Energy calibration includes defaults for, e.g., tth_initial_guess, which in general will be among the first things the user will want to specity.

The current Energy config file reader permits these defaults to be set only at the per-detector level, as in the snippet below.

pipeline:
  # Perform energy calibration first
  - edd.MCAEnergyCalibrationProcessor:
      # Ce lines: Ka2=34.28, Ka1=34.72, Kb1=39.257, Kb2=40.236
      peak_energies: [34.72, 34.28,  39.257, 40.236]
      max_peak_index: 0
      fit_index_ranges: [[650, 850]]
      config:
        spec_file: raw_data/edd23-char-1/spec.log
        scan_number: 87
        scan_step_index: 40
        flux_file: reduced_data/analysis_ARW/flux.dft
        detectors:
          - detector_name: 0
            tth_initial_guess: 7.105
            include_energy_ranges: [[50, 100]]
            tth_max: 10.0
          - detector_name: 2
            tth_initial_guess: 7.105
            include_energy_ranges: [[50, 100]]
            tth_max: 10.0
...

Propose: permit specification of defaults for all detectors. For instance any item in the "detectors" section that is NOT "detector_name" could be interepreted as a default for all detectors, UNLESS overwitten in the detector specific location, e.g.:

pipeline:
  # Perform energy calibration first
  - edd.MCAEnergyCalibrationProcessor:
      # Ce lines: Ka2=34.28, Ka1=34.72, Kb1=39.257, Kb2=40.236
      peak_energies: [34.72, 34.28,  39.257, 40.236]
      max_peak_index: 0
      fit_index_ranges: [[650, 850]]
      config:
        spec_file: raw_data/edd23-char-1/spec.log
        scan_number: 87
        scan_step_index: 40
        flux_file: reduced_data/analysis_ARW/flux.dft
        detectors:
          - tth_initial_guess: 7.105
          - include_energy_ranges: [[50, 100]]
          - tth_max: 10.0
          - detector_name: 0             # <----  detector "0" uses all three defaults above
          - detector_name: 2             # <---- detector "2" uses the defaults for tth_initial_guess and tth_max, but overrides the energy range, e.g. due to some errant peak in that particular case
            include_energy_ranges: [[50, 90]]
...

TOMO: detector pixels frame of reference

TOMO all detector pixels relative to detector frame, instead of to the cropped image bounds. In all figures, but also in nexus output quantities like center_rows.
Also add local/long names to relavant Nexus output fields

Use MapReader with SMB-style.par files

MapReader should be able to accept a full map config dict, or accept a par file and list of columns to be used as independent dims and construct a MapConfig from those. See edd.model.StrainAnalysisConfig for example.

Previously working EDD map pipeline now fails

Describe the bug
A previously working EDD map pipeline constructed in December now fails to work

To Reproduce
Provide a minimal pipeline configuration in which the bug appears:

config:
  root: /nfs/chess/auxiliary/cycles/2023-3/id1a3/ko-3538-b/
  inputdir: raw_data/edd23-char-1/
  outputdir: reduced_data/analysis_ARW/
  log_level: debug

pipeline:
  - common.MapReader:
      map_config:
        title: edd-char-1-87
        station: id1a3
        experiment_type: EDD
        sample:
          name: ceria
        spec_scans:
        - spec_file: spec.log
          scan_numbers: 
          - 87
        independent_dimensions:
        - label: samy
          units: mm
          data_type: spec_motor
          name: sampYcp
        # ADD ANOTHER IND. DIM. IF NEEDED
      detector_names: [0,2,3,5,6,7,8,10,13,14,16,17,18,19,21,22]
  - common.NexusWriter:
      filename: edd-char-1-87-TEST.nxs
      force_overwrite: true

Expected behavior
Should produce the nexus file listed above, edd-char-1-87-TEST.nxs.

Screenshots
(CHAP_edd) [aw30@lnx-id3b-2 analysis_ARW]$ CHAP edd-char-1-87-ceria-map-pipeline.yaml
CHAP.runner : INFO: Input pipeline configuration: [{'common.MapReader': {'map_config': {'title': 'edd-char-1-87', 'station': 'id1a3', 'experiment_type': 'EDD', 'sample': {'name': 'ceria'}, 'spec_scans': [{'spec_file': 'spec.log', 'scan_numbers': [87]}], 'independent_dimensions': [{'label': 'samy', 'units': 'mm', 'data_type': 'spec_motor', 'name': 'sampYcp'}]}, 'detector_names': [0, 2, 3, 5, 6, 7, 8, 10, 13, 14, 16, 17, 18, 19, 21, 22]}}, {'common.NexusWriter': {'filename': 'edd-char-1-87-TEST.nxs', 'force_overwrite': True}}]

CHAP.runner : INFO: Loaded <CHAP.common.reader.MapReader object at 0x7fdb7a015510>
CHAP.runner : INFO: Loaded <CHAP.common.writer.NexusWriter object at 0x7fdb7a0156f0>
CHAP.runner : INFO: Loaded <CHAP.pipeline.Pipeline object at 0x7fdb7a0153c0> with 2 items

CHAP.runner : INFO: Calling "execute" on <CHAP.pipeline.Pipeline object at 0x7fdb7a0153c0>
Pipeline : INFO: Executing "execute"

Pipeline : INFO: Calling "execute" on <CHAP.common.reader.MapReader object at 0x7fdb7a015510>
MapReader : DEBUG: Executing "read" with {'inputdir': '/nfs/chess/previousid1a3/2023-3/ko-3538-b/edd23-char-1', 'map_config': {'title': 'edd-char-1-87', 'station': 'id1a3', 'experiment_type': 'EDD', 'sample': {'name': 'ceria'}, 'spec_scans': [{'spec_file': 'spec.log', 'scan_numbers': [87]}], 'independent_dimensions': [{'label': 'samy', 'units': 'mm', 'data_type': 'spec_motor', 'name': 'sampYcp'}]}, 'detector_names': [0, 2, 3, 5, 6, 7, 8, 10, 13, 14, 16, 17, 18, 19, 21, 22]}
MapReader : INFO: Executing "read"
Traceback (most recent call last):
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/common/models/map.py", line 308, in validate_for_spec_scans
self.get_value(scans, scan_number, index)
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/common/models/map.py", line 387, in get_value
return get_spec_motor_value(spec_scans.spec_file,
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/common/models/map.py", line 446, in get_spec_motor_value
scanparser.get_spec_scan_motor_vals(
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/utils/scanparsers.py", line 1213, in get_spec_scan_motor_vals
raise NotImplementedError('Only relative motor values are available.')
NotImplementedError: Only relative motor values are available.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/nfs/chess/user/aw30/miniconda3/envs/CHAP_edd/bin/CHAP", line 33, in
sys.exit(load_entry_point('ChessAnalysisPipeline', 'console_scripts', 'CHAP')())
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/runner.py", line 100, in main
runner(run_config, pipeline_config)
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/runner.py", line 115, in runner
run(pipeline_config,
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/runner.py", line 218, in run
pipeline.execute()
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/pipeline.py", line 44, in execute
data = item.execute(data=data, **kwargs)
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/pipeline.py", line 168, in execute
data = method(**args)
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/common/reader.py", line 121, in read
map_config = MapConfig(**map_config, inputdir=inputdir)
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.init
File "pydantic/main.py", line 1076, in pydantic.main.validate_model
File "pydantic/fields.py", line 895, in pydantic.fields.ModelField.validate
File "pydantic/fields.py", line 928, in pydantic.fields.ModelField._validate_sequence_like
File "pydantic/fields.py", line 1094, in pydantic.fields.ModelField._validate_singleton
File "pydantic/fields.py", line 898, in pydantic.fields.ModelField.validate
File "pydantic/fields.py", line 1151, in pydantic.fields.ModelField._apply_validators
File "pydantic/class_validators.py", line 339, in pydantic.class_validators._generic_validator_basic.lambda14
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/common/models/map.py", line 568, in validate_data_source_for_map_config
return _validate_data_source_for_map_config(data_source, values)
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/common/models/map.py", line 566, in _validate_data_source_for_map_config
data_source.validate_for_spec_scans(values.get('spec_scans'))
File "/nfs/chess/user/aw30/Git_Repos/ChessAnalysisPipeline/CHAP/common/models/map.py", line 310, in validate_for_spec_scans
raise RuntimeError(
RuntimeError: Could not find data for sampYcp (data_type "spec_motor") on scan number 87 for index 0 in spec file /nfs/chess/previousid1a3/2023-3/ko-3538-b/edd23-char-1/spec.log

Environment (please complete the following information):
This was run on the cluster using the current version of the edd-spring2024 branch.

Additional context
I briefly thought the path was not properly resolving (my pipeline specifies

root: /nfs/chess/auxiliary/cycles/2023-3/id1a3/ko-3538-b/ 
inputdir: raw_data/edd23-char-1/

whereas the Traceback complains about the location:

/nfs/chess/previousid1a3/2023-3/ko-3538-b/edd23-char-1/spec.log

but these are clearly resolving to the same file.

Based on the error message "Only relative motor values are available" I suspect that this error is caused by changes made to accommodate the current EDD workflow.

Update README

Current README is very outdated.

Among things to add:

link to documentation
DOI badge for latest release ()

link to galaxy tool

Create a generic tool (in common? or utils?, a processor or writer?) to create a history and upload files from command line, given a map and potentially additional input files.

See: /nfs/chess/user/rv43/Tomo/workflow/link_to_galaxy.py
for the former Tomo tool that does something similar.

EDD Energy / tth calibration option to specify & combine more than one scan point for better statistics

In the current EDD calibration workflow, the option exists to calibrate using a specific point of a scan, e.g as follows:

pipeline:
  # Perform energy calibration first
  - edd.MCAEnergyCalibrationProcessor:
      # Ce lines: Ka2=34.28, Ka1=34.72, Kb1=39.257, Kb2=40.236
      peak_energies: [34.72, 34.28,  39.257, 40.236]
      max_peak_index: 0
      fit_index_ranges: [[650, 850]]
      config:
        spec_file: raw_data/edd23-char-1/spec.log
        scan_number: 87
        scan_step_index: 40
        flux_file: reduced_data/analysis_ARW/flux.dft
        detectors:
          - detector_name: 0
            tth_initial_guess: 7.105
            include_energy_ranges: [[50, 100]]
            tth_max: 10.0
...

Propose change "scan_step_index" to "scan_step_indices" to allow the user to specify performing the calibration with a SUM of spectra from multiple points in scan, e.g.:

pipeline:
...
      config:
        spec_file: raw_data/edd23-char-1/spec.log
        scan_number: 87
        scan_step_indices: [38,39,40,41,42]
...

Ot better yet, something functionally equivalent to

pipeline:
...
      config:
        spec_file: raw_data/edd23-char-1/spec.log
        scan_number: 87
        scan_step_indices: list(range(30,51))
...

This would permit calibration with better statistics.

Bugs in docs/installation.md

The values supplied to options for the literalinclude directive need to be enclosed in quotes.

Language identifiers for code blocks should not be in braces.

CHAP.runner.parser should be a function that returns an argparse.ArgumentParser

--profiler option in galaxy tools

Add --profiler option to galaxy tools.

Broken action: "Deploy Sphinx documentation to Pages"

Github's new deployment protection rule breaks the "Deploy Sphinx documentation to Pages" action.

From github's blog:

"...We are also preventing tags with the same name as a protected branch from deploying to the environments with branch policies around protected branches."

conda-forge and galaxy-dev ci/cd

There are still manual steps in the CHAP release -> galaxy tool pipeline:

Approval of merge request from the conda-forge regro-cf-autotick-bot (usually happens after ~2 hours)
Installation of galaxy tools on galaxy-dev (must happen only after 1. is done)

According to Installing Tools into Galaxy:

Automated installation - The process of installing tools from Tool Shed can be performed in an automated way using a set of scripts. This is particularly useful if you are trying to install a large number of tools. The required scripts are available as an Ansible playbook from here. Please see that page for complete instructions.

Energy Calibration output image files need "detector_name" included.

In the spring 2024 branch, Rolf helped me change line 1266 from this:

            figfile = os.path.join(outputdir, 'energy_calibration_fit.png')

to this:

            figfile = os.path.join(outputdir, f'energy_calibration_fit_{detector.detector_name}.png')

to disambiguate output files when fitting more than one detector. (I forgot to make a new branch prior to making this change so am suggesting it here rather than with a pull request...)

EDD fall 2023

Remaining tasks for the fall 2023 EDD workflow:

Add new Processor to refine lattice parameters.

StrainAnalysisProcessor & MCACeriaCalibrationProcessor:

select mask & HKLs for fitting in one interactive plot (not two separate ones).

DiffractionVolumeLengthProcessor:

Include text annotation on final plot: measured DVL
Include in results: parameters of the gaussian fit
Account for sample thickness

StrainAnalysisProcessor:

CI/CD for galaxy tools

Add a github workflow that uses planemo to deploy the CHAP galaxy tools to the usual toolshed. On every commit? Or on every new release?

CHAP.version

Add a CHAP.version attribute that holds the version string.

PipelineItem.get_configs

Add a get_configs method to PipelineItem. Use it to replace the nearly identical get_configs methods on so many of the implemented Processors.

Roughly:

def get_configs(self, data, schema):
    """Look through `data` for an item whose value for the `'schema'`           
    key matches `schema`. Convert the value for that item's `'data'`            
    key into the configuration `BaseModel` identified by `schema` and           
    return it.                                                                  
                                                                                
    :param data: input data from a previous `PipelineItem`                      
    :type data: list[PipelineData]                                              
    :param schema: name of the `BaseModel` class to match in `data` & return    
    :type schema: str                                                           
    :raises ValueError: if there's no match for `schema` in `data`              
    :return: matching configuration model                                       
    :rtype: BaseModel                                                           
    """

NB: This would impact pipeline configuration: schema will now need to get the full module path to the BaseModel of interest. For example, in examples/edd/pipeline.yaml, L14:
schema: MCACeriaCalibrationConfig
would need to become
schema: edd.MCACeriaCalibrationConfig
If we also add members of CHAP/edd/models.py to CHAP/edd/__init__.py.

Refactor Tomo from the NXTomo structure to a MapConfig compatible structure

Refactor Tomo from the NXTomo structure to a MapConfig structure, including using independent_dimensions for SMB. This has some issues with the theta dimension that would need a few hard wired code parts (get_spec_scan_npts and getting theta associated with a fake motor mnemonic), but looks otherwise pretty straightforward.

For SMB you would have in the map yaml something like:
independent_dimensions:

label: theta
units: degrees
data_type: spec_motor
name: ome
label: horizontal_shift
units: mm
data_type: smb_par
name: labx
label: vertical_shift
units: mm
data_type: smb_par
name: labz
scalar_data:
label: theta_start
units: degrees
data_type: smb_par
name: omestart
label: theta_end
units: degrees
data_type: smb_par
name: omeend
label: num_theta
units: '-'
data_type: smb_par
name: nframes

unit tests

Add unit tests for the CHAP module; integrate their execution with CI/CD.

Enumerate testable entities
Decide what metrics need to be tested on each one
Implement

Flexible spec_numbers input in pipeline.yaml

Right now you enter spec info in the pipeline.yaml like:
spec_scans:
- spec_file: set2_c1-1/spec.log
scan_numbers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]

Create an option where you can enter the scan_numbers to include as a range along the line of:
spec_scans:
- spec_file: set2_c1-1/spec.log
scan_number_range: [1,13]
Or something similar.

Galaxy tool tests

Make sure every galaxy tool .xml file has something in its <tests> section.

CI/CD for pylint

Add a github action to run pylint on the CHAP module on every commit. Use the existing .pylintrc file to configure pylint, but change the fail-under parameter on line 42 to 8.

Docs

Add docs, build w/ sphinx & deploy to github pages w/ CI/CD.

Add cursory galaxy docs for users to the CLASSE computing wiki.

Add configuration pipeline item

Describe intended use of the requested PipelineItem
To do proper integration with CHAPBook and allow portable configuration and location of pipieline workflows it will be useful to develop (with some defaults) a configuration module which will perform the following:

setup CHAP ROOT location which will be used by CHAP tool and pipieline to load various workflows
location of workflows area
optionally define additional flags, like profile, etc.

Then, we can use such configuration pipeline module/item in every configuration, e.g.

# configuration
config:
  root: /path/workflows
  profile: true
  verbose_level: 1
  interactive_prompt: true
  input: /path/to/input/location
  output: /path/to/output/location

pipeline:
  # Collect map data
  - common.YAMLReader:
      filename: examples/saxswaxs/map_1d.yaml
      schema: MapConfig

In this example, the root defines main root area CHAP will use and all other pipeline files will use it, i.e. the examples/saxswaxs/map_1d.yaml will be defined not within current working directory but wrt to root path. The profile option will run profiler such that we do not need to specify this parameter at CLI, etc. We may have further extensions to configuration module, e.g. run interactive pipeline, or use verbosity level.

Additional context
Such configuration module will allow integration with CHAPBook service and/or relocation some parts of CHAP parts, e.g. examples, workflows, etc. to a different location.