ctlearn-project / ctlearn Goto Github PK

View Code? Open in Web Editor NEW

54.0 54.0 42.0 82.27 MB

Deep Learning for IACT Event Reconstruction

License: BSD 3-Clause "New" or "Revised" License

Python 90.58% Jupyter Notebook 8.85% Shell 0.56%

ctlearn's People

Contributors

Stargazers

Watchers

ctlearn's Issues

Fix bug with DataLoader metadata

The following code fails with a KeyError:

>>> from ctalearn.data_loading import HDF5DataLoader
>>> data_loader = HDF5DataLoader(['/home/shevek/datasets/sample_prototype/gamma_20deg_0deg_srun4-100___cta-prod3_desert-2150m-Paranal-HB9_cone10.h5', '/home/shevek/datasets/sample_prototype/proton_20deg_0deg_srun1-10___cta-prod3_desert-2150m-Paranal-HB9.h5'])
>>> data_loader.get_metadata()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/shevek/brill/ctalearn/ctalearn/data_loading.py", line 312, in get_metadata
    metadata['total_aux_params'] += metadata['num_additional_aux_params']
KeyError: 'num_additional_aux_params'

Support for channels-first mode

When training on GPU, changing the dimension order can give a training speedup. It is unclear whether we are limited by I/O or by actual training speed, so the effect on final training performance may be limited.

See https://www.tensorflow.org/performance/performance_guide#data_formats for explanation.

Make the Model model_directory point to ctlearn/models by default

Make this setting optional, with the default pointing to the ctlearn/models directory wherever CTLearn is installed. This can be done in a similar way to f4a420f and ffcfda8. This simplifies using default models and makes one less thing that the user has to change to replicate results using provided config files.

Implementing CNNBlock Class

Implement CNNBlock class to define a standard interface for all CNN blocks.

Add removal instructions

Some users expressed some concerns that the recommended installation procedure took up quite a bit of space in their disks and that some instructions dealing with the removal of ctalearn and the cleaning of the all the dependencies that were installed altogether with ctalearn would be appreciated.

Make mapping tables configurable

At present, the mapping tables from pixel vectors to camera shapes are fixed. The following configuration options may be added:

Additional padding around the camera image (default: none). This could be useful for matching a shower image to the fixed size expected by a model if resizing is not preferred, or for constraining images from different cameras to have the same shape. This applies to all telescope types.
Hexagonal to square pixel conversion method, such as oversampling or warping, to apply (default: uncertain). Each method will require its own parameters, e.g. whether to apply smoothing when oversampling and which technique to use for it. This applies only to telescope types with hexagonal pixels.

Correct telescope sorting options

Make the sort_telescopes_by_trigger option do what it says, and include the current functionality as a separate option sort_telescopes_by_size.

Reproduction of benchmark 0.2.0 results

I have reproduced the benchmark 0.2.0 results in the UCM server (unsure of the NVIDIA GPU model @nietootein ?).

Input	Telescope Type	Train Events	Val Events	Train steps	Batch size	Train time	Val Acc	Val Gamma Acc	Val Proton Acc	Val AUC
Single	LST	161631	17960	37500	16	1h 0m 55s	0.7034521	0.63914883	0.7676981	0.7905172
Single	MSTF	666288	74032	37500	16	1h 25m 39s	0.7445564	0.8048835	0.68821025	0.8311684
Single	MSTN	772385	85821	37500	16	1h 29m 53s	0.7803801	0.82118773	0.742127	0.8679488
Single	MSTS	541990	60222	37500	16	1h 14m 38s	0.78469664	0.8445853	0.7228224	0.86962146
Single	SST1	379611	42180	37500	16	55m 59s	0.7793741	0.8133864	0.74446315	0.8592271
Single	SSTA	404866	44986	37500	16	1h 7m 45s	0.72549236	0.66570693	0.78389066	0.8107811
Single	SSTC	392626	43626	37500	16	1h 5m 31s	0.7493009	0.75805485	0.7405917	0.8215633
Array	LST	76860	8541	37500	16	37m 27s	0.7280178	0.8020253	0.6643433	0.82055163
Array	MSTF	224831	24982	37500	16	2h 7m 11s	0.80393887	0.8210466	0.78773	0.895706
Array	MSTN	242425	26937	37500	16	2h 11m 18s	0.8277462	0.87108314	0.7881501	0.9178416
Array	MSTS	200745	22306	37500	16	2h 31m 13s	0.8198691	0.8415895	0.7982652	0.9099872
Array	SST1	178090	19788	37500	16	3h 13m 27s	0.7991712	0.7722142	0.8288231	0.89470106
Array	SSTA	165302	18367	37500	16	2h 9m 23s	0.76920563	0.7152889	0.82571363	0.86622685
Array	SSTC	171574	19064	37500	16	1h 51m 9s	0.8208665	0.81669563	0.8252668	0.90955627

The results are comparable to those reported in config/v_0_2_0_benchmark/readme.md.

The models where the difference in accuracy was more than 1% are: SSTC single telescope, SST1 array, SSTA array, SSTC array. In all those cases the difference in accuracy was below 2%.

The difference in AUROC was below 0.01 in magnitude for all models except SSTA array, where it was 0.01062685.

The difference in train times is more significant, but this can be easily explained by the use of different machines.

Refactor data loading

Refactor load_HDF5_data.py to have a data loader class parses settings and has methods to return numpy arrays of data, instead of a set of free-floating functions that must be called in an undocumented order. Instead of being stored in separate external dictionaries, the metadata, auxiliary data, and processed parameters should be stored internally in the class. The class should have a legible storage structure that distinguishes between parameters inherent in the dataset (metadata, auxiliary data) and those that rely on the settings specified by the user (the settings arguments and "processed parameters").

A suggested API is as follows:
class HDF5_data_loader(data_files, data_loading_settings, data_processing_settings, image_mapping_settings)
The settings arguments are dictionaries with settings for methods implemented directly in this module, implemented in process_data.py, and image_mapping.py, respectively.

HDF5_data_loader.load_data(filename, index)
Return the data from the specied filename and index as a numpy array. Note that because data_loader already knows whether single or multiple tel data are requested as a setting, and the metadata and auxiliary data are stored in the class, it's now only necessary to specify the filename and index. The method should automatically return the correct kind of data.

HDF5_data_loader.get_generators(training=False, validation=False, test=False)
Returns the specified generators. Allowed arguments are training=True, validation=False, test=False; training=True, validation=True, test=False; training=False, validation=False, test=True; all other combinations raise an error.

Depends on #14.

Rewrite MobileNet implementation

The current MobileNet code has a dependency on Tensorflow slim, with the implication that the train op must be implemented in slim as well. Rewrite it using the standard layers API.

Add time channel to images

This is a request for a new feature.

So far only single-channel images, where that channel contains the image charge, are loaded and passed to the networks. The arrival time of such a charge is also available for most telescope types (except for ASTRI) and actually stored in the DL1 h5 files. These arrival times have being used in the past to help "cleaning" the charge image, since the arrival times for those pixels receiving most of their charge from photons coming from the showers are correlated, whereas the arrival times for neighboring pixels illuminated just by night sky background are typically uncorrelated. Thus, it would be interesting to be able to parse our data as two-channel images, one channel containing the charge and the other containing the arrival times, hoping that this additional timing channel may improve on the event reconstruction that is performed considering solely the charge.

Implementation of this new feature seems more natural after #29 is resolved.

Make train.py compatible with the data_loader and data_processor classes

Split data_processing_settings into data_loading_settings and data_processing_settings. There should be four dictionaries of data-related settings:
data_input_settings: for TensorFlow Estimator input_fn
data_loading_settings: for loading HDF5 data (methods directly implemented in load_HDF5_data.py)
Includes validation_split, min_num_tels and cut_condition, use_telescope_position, chosen_telescope_types, model_type
data_processing_settings: for processing data in process_data.py
Includes sort_telescopes_by_trigger, crop_images, log_normalize_charge, all cleaning and cropping options
image_mapping_settings: for mapping pixels to arrays in image_mapping.py, to be implemented in a separate issue #10

Rewrite the section stating on line 140 with "if data_format == 'hdf5':" to use the data_loader class methods.

Depends on #15 and #16.

Specify prediction output format

Specify the predict output format and make predict.py able to both return and write to file data in that format. A suggested format is a list of filenames and a numpy array of file_index, event_index, predictions, classifier values, where file_index is the index in the list of filenames and event_index corresponds to index in the event table for event classification and index in the a telescope table for single telescope classification. Predictions and classifier values are the contents of the dictionary returned by tf.Estimator.predict(). The output format needs to handle both the cases in which the true labels are available (simulations) and are not available (data). The long-term aim is for it to be easy for ctapipe to read in and translate the predictions into its native format.

Make visualize_bounding_boxes.py use the HDF5_data_loader class

Depends on #15.

DivisionByZeroError in `apply_cuts` during `HDF5DataLoader` initialization

Configuration file of the run:
20180905_231431_config.txt

(it's actually a .yml file, but I had to change the extension to upload it)

List of example files used (they are part of the benchmark):
sample_files.txt

Traceback:

Traceback (most recent call last):
  File "/home/jsevillamol/Documentos/ctlearn_clean/ctlearn/run_model.py", line 459, in <module>
    run_model(config, mode=args.mode, debug=args.debug, log_to_file=args.log_to_file)
  File "/home/jsevillamol/Documentos/ctlearn_clean/ctlearn/run_model.py", line 130, in run_model
    **data_loading_settings)
  File "/home/jsevillamol/anaconda3/envs/ctlearn/lib/python3.6/site-packages/ctlearn/data_loading.py", line 136, in __init__
    self._apply_cuts()
  File "/home/jsevillamol/anaconda3/envs/ctlearn/lib/python3.6/site-packages/ctlearn/data_loading.py", line 536, in _apply_cuts
    self.class_weights.append(num_examples/float(self.passing_num_examples_by_particle_id[particle_id]))
ZeroDivisionError: float division by zero
Closing remaining open files:/home/jsevillamol/Documentos/datasample/gamma_20deg_0deg_srun103-219___cta-prod3_desert-2150m-Paranal-HB9_cone10.h5...done/home/jsevillamol/Documentos/datasample/proton_20deg_0deg_srun1-10___cta-prod3_desert-2150m-Paranal-HB9.h5...done

You can replicate it also just running this:

data_files = ['/home/jsevillamol/Documentos/ctlearn/datasample/gamma_20deg_0deg_srun103-219___cta-prod3_desert-2150m-Paranal-HB9_cone10.h5', '/home/jsevillamol/Documentos/ctlearn/datasample/proton_20deg_0deg_srun1-10___cta-prod3_desert-2150m-Paranal-HB9.h5']
image_mapping_settings = {'hex_conversion_algorithm': 'oversampling', 'padding': {'LST': 2, 'MSTF': 1, 'MSTN': 2, 'MSTS': 4, 'SST1': 1, 'SSTA': 0, 'SSTC': 0, 'VTS': 1}}
data_loading_settings = {'cut_condition': '(mc_energy > 1.0) & (h_first_int < 20000)', 'example_type': 'array', 'min_num_tels': 1, 'seed': 1234, 'selected_tel_ids': [1, 2, 3, 4], 'selected_tel_type': 'LST', 'validation_split': 0.1}
    
data_loader = HDF5DataLoader(
            data_files,
            mode='train',
            image_mapper=ImageMapper(**image_mapping_settings),
            **data_loading_settings)

(change data_files to whichever location has the relevant files)

Clean up repository

Ensure that the Readme is up to date with all the changes for v0.2.0. Update the version number in setup.py.

Create a directory called deprecated in the models directory and move all the unused models there. The unused models are all but cnn_rnn.py, single_tel.py, and variable_input model.py. The deprecated models will be removed in the next release if not used by then.

Remove the unnecessary ctalearn/ctalearn/scripts directory. Put train.py and predict.py in ctalearn/ctalearn/. (These will be merged into a single module, see #22.)

Create a ctalearn/config directory and move example_config.ini there. This is also where config files for standard networks will go.

Create a ctalearn/scripts directory and move plot_classifer_value.py, plot_roc_curves.py, visualize_bounding_boxes.py, train_configurations.py, and test_metadata.py there.

Remove plot_gpu_util.py as it is not relevant for using ctalearn.

Move ctalearn/misc/images/ to ctalearn/images and delete the ctalearn/misc/ directory.

Make plot_classifier_value.py and plot_roc_curves.py compatible with predict output

Depends on #23.

Allow array-level models to handle additional auxiliary info

In the current implementation of combine_telescopes in the variable input model, the only possible auxiliary info are the telescope positions and telescope triggers. Because cropping is now an option, the shower centroid (x, y) should be provided as additional auxiliary info to the model knows where in the camera the shower was detected.

One implementation to cleanly allow this is to rename the telescope_positions to telescope_auxiliary_info and concatenate the telescope positions with the shower positions to produce a single auxiliary info tensor. The model should be provided with some structure allowing it to parse the tensor into its components if needed. It is also critical that the total number of auxiliary inputs per telescope be passed into the network, which will require the metadata to be updated after the data processing arguments are known. Therefore this logic should be handled in add_processed_parameters() in data.py.

Change gen_fn_HDF5 and load_data functions so they don't return/require a bytestring filename.

Previous versions of TF have required that string tensors produced and modified using Datasets, map, and pyfunc be converted to bytestrings (no automatic conversion of Python strings). It is unclear if this issue is fixed in Tensorflow 1.5.1, but if so, should be changed to use Python strings rather than bytestrings.

Replace ConfigObj config file format with YAML

train_configurations.py script must be rewritten to use YAML-formatted config file. Example config file needs to be re-written. Any relevant value-checking/exceptions must be added. Loading/reading config in run_model.py must also be refactored. Variable input model must be updated to match new configuration option names.

The standard pattern for loading configuration options is as follows: The settings which are used for each part of the pipeline/each class (DataLoader, DataProcessor, ImageMapper) should be collected in separate sections within the config file and read in as a complete dictionary. This dictionary is then unpacked directly into the constructor for the corresponding class to pass all desired settings(which should all be implemented as keyword arguments).

Interpreting output of training

I have run one of the benchmark configurations (concretely config/v_0_2_0_benchmarks/LST_cnn_rnn_config.yml) on training mode.

The beginning of the logfile.log produced is as follows:

INFO:Batch size: 16
INFO:Training and evaluating...
INFO:Total number of training events: 76860
INFO:Total number of validation events: 8541
INFO:Number of training steps per epoch: 4803
INFO:Number of training steps per validation: 2500
INFO:Total number of examples: 377098
INFO:Number of gamma (class 0) examples: 189633 (50.287%)
INFO:Number of proton (class 1) examples: 187465 (49.713%)

This prompts me the following questions:

I would have assumed that the Total number of examples was the summation of training and validation events, but it is not. What does it represent then? In single telescope runs it coincides with the summation of training and validation events.
Minor suggestion: Number of training steps per validation should be changed to Number of training steps *between* validations. I got very confused trying to understand what it meant.

After each evaluation, a line is produced saying things like INFO:Saving dict for global step 2500: accuracy = 0.6158129, accuracy_gamma = 0.6657754, accuracy_proton = 0.5658949, auc = 0.6625105, global_step = 2500, loss = 1.3074992.

I assume that those are the metrics on the validation set (correct?). Are they just stored in the eval_validation/events.out.tfevents file?

If so, how could I produce something similar to the plots in config/v_0_2_0_benchmark/readme.md? The plotting scripts in scripts seem to expect a list of .csv files as an argument. I assume this would be the output of running a trained net in prediction mode, but how do I feed the validation set I used for training to the trained net for predictions?

Also, what is the easiest way of reading the training time off the output?

One more question: since the number of training steps per epoch is different but the number of epochs is the same in each config.yml, how come that the last checkpoint is always in step 37500?

Refactor data processing

Refactor process_data.py to have a data processor class for data manipulation and augmentation with methods that accept numpy arrays of unprocessed data and return numpy arrays of processed and/or augmented data. The implementation should be generic with no dependencies on other ctalearn modules.

A suggested API is as follows:

class data_processor(data_processing_settings)
The data_processing_settings are a dictionary that is passed from data_loader.

data_processor.process_data(data)
Argument: numpy array of data
Returns a numpy array of processed data.

data_processor.augment_data(data)
Argument: numpy array of data
Returns a numpy array of processed data. Currently, since no data augmentation is implemented, so this function doesn't do anything, but is where this functionality will be put in the next version.

Depends on #14.

Add config files and optional graphs for important results

Add config files for any results that we use as benchmarks or have shown in presentations. This will aid in archiving and reproducing the results. We should show benchmarks for the newest version of ctalearn, so we can rerun a single telescope network when the other updates have been made and use that as the new benchmark. We should also include the config files for the CNN_RNN results that have been shown in presentations and posters. Optionally, graphs and descriptions can be added as well.

The config files should go in ctalearn/config/ and any plots in ctalearn/images/ (see #25).

Add mapping tables for all telescope types

At present the only mapping table in image.py from vectors of pixels to the image shape is for the SCT (MSTS). In order to process data for other telescope types, mapping tables for other telescope types must be added. It would make sense to start with square-pixel telescopes, as there isn't yet a standard method for converting hexagonal pixels to a square grid.

Make package compatible with ConfigObj

Example configuration files are currently being stored in two locations, ctalearn/config and ctalearn/ctalearn/config. Determine which location is better and merge the two directories.

The script train_configurations.py for automating hyperparameter searches relies heavily on the ConfigParser file format. Update it to match the new configuration file format.

As currently written, run_models.py requires an additional command line argument for the config spec. This is unnecessary because the configuration spec won't change from run to run. Instead hardcode the path to the config spec. See f4a420f and ffcfda8 for a good way to do this.

Update variable_input_model.py to use the new configuration option names.

Add a comment in example_config.conf to indicate that the allowed values and types for each option are identified in config_spec.ini.

Fix or remove test_metadata.py

The use case of the script script test_metadata.py is unclear and needs to be clearly defined.

Assuming it's worth keeping, several problems need to be fixed. As written, sections are missing and it doesn't run. There is a hardcoded path on line 26 that needs to be removed. The name "test_metadata" is ambiguous and the script should be renamed to remove the word "test". It also should be updated to be compatible with the HDF5_data_loader class (see #15).

Alternatively, if the script isn't worth keeping, remove it.

Implement Single Telescope training capability

Currently the approach used for training single tel models is to attach a logits layer directly to an array-model CNN-block. It may be worthwhile to implement a way to train arbitrary single tel models without this restriction.

Add default thresholds for two level cleaning for all telescope types

This will allow reasonable cropping in DataProcessor by default with telescopes other than MSTS. The visualize_bounding_boxes.py script can be used to manually determine reasonable thresholds.

Deep copy nested dicts in run_multiple_configurations.py run combinations file

Dictionary config arguments in the run combinations file outputted by run_multiple_configurations.py are sometimes not deep copied, with references appearing instead of the actual configuration parameter. For example, this occurs in the layers parameter recorded in run_combinations.py created when running the v0.2.0 benchmark config (snippet below). This issue doesn't seem to affect performance, just the appearance of the parameters in the saved file.

run00:
  batch_size: 64
  example_type: single_tel
  layers: &id001
  - filters: 32
    kernel_size: 3
  - filters: 32
    kernel_size: 3
  - filters: 64
    kernel_size: 3
  - filters: 128
    kernel_size: 3
  learning_rate: 5.0e-05
  model:
    function: single_tel_model
    module: single_tel
  sorting: null
  tel_type: LST
run01:
  batch_size: 16
  example_type: array
  layers: *id001
  learning_rate: 0.0001
  model:
    function: cnn_rnn_model
    module: cnn_rnn
  sorting: size
  tel_type: LST

Add option to list data files directly in configuration file

Since YAML provides a convenient way to include lists directly in the config file, allow Data:file_list to accept either a path to a file containing file paths (the current method) or a list of file paths written directly in the config file.

Make variable_input_model.py compatible with model importing

Telescope IDs are not unique, causing critical error

The current design of DataLoader relies on the tel_id parameter as a unique key to index the telescopes, making checks that each tel_id corresponds to a telescope of the correct type. However, this assumption is invalid. The telescope IDs in the MLProto dataset are not unique. The conflict is with SST1. IDs 1-4 are used by both SST1 and LST; 5-29 by SST1 and MSTF; and 30-33 by SST1 and MSTN.

For example, tel_id 1 is assigned to SST1, so when running a model using LST data, run_model.py crashes with the following output:

Traceback (most recent call last):
  File "ctlearn/ctlearn/run_model.py", line 456, in <module>
    run_model(config, mode=args.mode, debug=args.debug, log_to_file=args.log_to_file)
  File "ctlearn/ctlearn/run_model.py", line 130, in run_model
    **data_loading_settings)
  File "/home/shevek/software/anaconda3/envs/testing-brill/lib/python3.6/site-packages/ctlearn/data_loading.py", line 125, in __init__
    self._select_telescopes(selected_tel_type, tel_ids=selected_tel_ids)
  File "/home/shevek/software/anaconda3/envs/testing-brill/lib/python3.6/site-packages/ctlearn/data_loading.py", line 366, in _select_telescopes
    raise ValueError("Selected tel id {} is of wrong tel type {}.".format(tel_id, all_tel_ids[tel_id]))
ValueError: Selected tel id 1 is of wrong tel type LST.

The treatment of tel_ids in DataLoader needs to rewritten to accommodate this. The best way may be to move away from integer tel_ids, so that the overall tel_id could be either a string of the telescope type concatenated with the tel_id number, perhaps with a separator, or a tuple of (tel_type, tel_id).

Append CTLearn version to the copy of the configuration file generated for each training

In order to ensure the reproducibility of each training run a copy of the parsed config file is stored along with the rest of training outputs. Ideally, such reproducibility should not depend on the version of the code that was used for training, although this dependency might be present during the pre-release development phase. Therefore, appending the code version to the copy of the configuration file may be helpful.

Define API for accessing DataLoader class attributes

All HDF5DataLoader class attributes are currently specific to that class as opposed to DataLoader, but are accessed outside the class in run_model.py.

When defining data_loader, generator_output_dtypes, map_fn_output_dtypes, output_names, and output_is_label are all accessed. It doesn't seem these are specific to the data format, so if they are made part of the DataLoader base class, then this entire section except for data_loader = HDF5DataLoader() can be pulled out of the if clause, resulting in cleaner code.

Similarly, when getting the event indices in predict mode, example_type and examples are accessed and the event index names are manually defined, but again it may be possible to define all these in such a way that the event indices whatever they may be could be accessed for any DataLoader without relying on specific attributes for each DataLoader subclass.

This issue is probably best tackled at the time we actually add another DataLoader subclass.

Add missing requirements

Add the following missing requirements to the requirements files: scipy, configobj, validate.

Improve configurability of training hyperparameters

At present, most of the training hyperparameters are required config options. Make these optional, providing reasonable defaults.

Add the ability to choose any optimizer available in TensorFlow (or at least all the commonly used ones), as well as the ability to configure their parameters. Since there are a variety of optimizers and they all take different arguments, this could be accomplished by just having an optimizer_arguments dictionary that is passed directly to the optimizer on initialization without any intermediate parsing. This would replace the current adam_epsilon configuration parameter.

Add options for additional training hyperparameters such as regularization type and strength and learning rate annealing.

Reproducibility

Hi,

I'm trying to run your models and reproduce your results, but it wasn't possible because there aren't example datasets in your repository.

Can you provide links where we can download some simtel or hdf5 files you use to train the models?

Greetings,
-- mavillan

Add Unit Tests

Although other priorities (reorganizing the code, implementing new functionality, and making the project accessible and available for outside use) have been our primary focus so far, as the codebase becomes larger and we consider the possibility of outside contributions it seems like a good idea to begin looking seriously at implementing tests for maintaining and monitoring the code quality.

The project is relatively small and there is no need for the sort of serious testing that is used in more complicated software. However, with several interacting components, it seems like a few simple tests may be appropriate and helpful. Testing portions of the code in isolation should make it easier to recognize bugs and easier to test new changes without having to run the full training pipeline and search for errors by hand. Tests may also be helpful as a way of evaluating pull requests and preventing code regression/breaking changes.

The framework for testing and continuous integration with pytest and TravisCI is already in place, all that remains to be written are the tests themselves. The overall approach was based on the implementation of tests in ctapipe, which seems like it may be a helpful example.

Maybe we can discuss below what (if any) tests might be appropriate/helpful and who would be willing to write them.

Add iPython notebook demonstrating loading data with CTLearn

Demonstrate the DataLoader API and how it can be used with the HDF5 format. Perhaps calculate class balance and multiplicity distribution for each telescope type. The print_dataset_metadata.py script could be integrated into this.

Move open_file and close_file calls outside load_data functions

If Dataset.from_generator is changed to support multi-threading like Dataset.map, calls to open HDF5 files should preferably be moved outside of the load_data functions. i.e. all HDF5 files should be opened in advance and the file handles passed into the load_data functions instead of filename strings. This avoids a lot of unnecessary open and close files calls.

On a test of loading 1000 examples from a single HDF5 file with load_data_single_tel_HDF5, moving the file open and closing outside of load_data_single_tel_HDF5 (instead of calling them for each example) reduced runtime by ~30%. Improvement is probably much larger when >>1000 examples are read per HDF5 file.

Fix logging of array examples by class

When running an array-level model HDF5DataLoader.log_class_breakdown() incorrectly reports the number of examples by class. For example, from the logfiles produced when running the LST single tel and CNN-RNN v0.2.0 benchmarks, single tel has the correct behavior:

INFO:Batch size: 64
INFO:Training and evaluating...
INFO:Total number of training events: 161631
INFO:Total number of validation events: 17960
INFO:Number of training steps per epoch: 2525
INFO:Number of training steps per validation: 2500
INFO:Total number of examples: 179591
INFO:Number of gamma (class 0) examples: 89165 (49.649%)
INFO:Number of proton (class 1) examples: 90426 (50.351%)

but for the array-level model, CNN-RNN, the total number of examples and examples by class are much larger than the actual number of events used (in this case, 76860+8541=85401):

INFO:Batch size: 16
INFO:Training and evaluating...
INFO:Total number of training events: 76860
INFO:Total number of validation events: 8541
INFO:Number of training steps per epoch: 4803
INFO:Number of training steps per validation: 2500
INFO:Total number of examples: 377098
INFO:Number of gamma (class 0) examples: 189633 (50.287%)
INFO:Number of proton (class 1) examples: 187465 (49.713%)

This may be because the function is reporting the total number of examples in the dataset before the min_num_tels and possibly cut_condition cuts are applied.

Benchmark in peak_times channel mode

input_type	tel_type	auroc
array	LST	0.8286985
array	MSTF	0.89471704
array	MSTN	0.9177482
array	MSTS	0.91144246
array	SST1	0.8937345
array	SSTA	0.86309916
array	SSTC	0.90838027
single_tel	LST	0.7877764
single_tel	MSTF	0.8263566
single_tel	MSTN	0.8663868
single_tel	MSTS	0.866212
single_tel	SST1	0.85580283
single_tel	SSTA	0.8018345
single_tel	SSTC	0.80345446

Troubleshooting the "Run a Model" section of the readme

I am trying to run a sample model with the example_config.yml provided in the repo.

With a terminal opened in the root directory of the repo (/ctlearn), I ran the following two commands:


export CTLEARN_DIR=/home/jsevillamol/Documentos/ctlearn/ctlearn
python $CTLEARN_DIR/run_model.py config/example_config.yml

This produces the following error:

Traceback (most recent call last):
  File "/home/jsevillamol/Documentos/ctlearn/ctlearn/run_model.py", line 459, in <module>
    run_model(config, mode=args.mode, debug=args.debug, log_to_file=args.log_to_file)
  File "/home/jsevillamol/Documentos/ctlearn/ctlearn/run_model.py", line 65, in run_model
    model_module = importlib.import_module(config['Model']['model']['module'])
  File "/home/jsevillamol/anaconda3/envs/ctlearn/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'cnn_rnn'

But the cnn_rnn.py file is in its proper folder (ctlearn/models).

What am I doing wrong?

I am using Python 3.6 on Ubuntu 16.04, with Tensorflow-gpu 1.10

Make data loading use image_mapper class

Depends on #20.

Merge predict.py into train.py

Combine the functionality of predict.py into train.py. Prediction will be toggled on as a mode train.py --predict. In the current implementation, there is a very large amount of duplicated code that is changed often, which is an unsustainable situation. While the script train.py being used for prediction is a slight misnomer, ignore this for now. Add a [Predict] section to the config file including options for whether to export as a file, the filename, and whether the true label is present in the data files.

Refactor image.py to use a class

To reduce ambiguity, rename the module to image_mapping.py.

A suggested API is as follows:

class image_mapper(image_mapping_settings)
Argument: dictionary of settings. Right now there isn't anything relevant implemented. See #10 for more details on what the settings should be.

image_mapper.map_image(pixels, telescope_type)
Arguments: pixels is a numpy array of values for each pixel, in order of pixel index. The array has dimensions [N_pixels, N_channels] where N_channels is e.g. 1 when just using charges and 2 when using charges and peak arrival times. telescope_type is a string specifying the telescope type as defined in the HDF5 format, for example 'MSTS' for SCT data, which is the only currently implemented telescope type.
Returns: A numpy array of data with shape [length, width, depth] corresponding to a telescope image mapped to an array.

Implement loading variables from single tel checkpoints to array-level models.

See https://www.tensorflow.org/api_docs/python/tf/contrib/framework/init_from_checkpoint.

Clean up input_fn in run_model.py

Use dictionary unpacking to pass in settings instead of passing the dictionary directly. This will provide a convenient way to provide default arguments, making it easier to have optional data input settings. At least one options that is currently configurable should be hardcoded, that is whether to use dataset.map() - this is already constrained by the choice of data format. Also, investigate whether it's worth making whether to prefetch a configurable option or if there's a best practice for it that could just be hardcoded.

Minimize dependencies in virtual environment

We should filter out from requirements.txt all those dependencies that are not actually necessary for ctalearn to properly run, so the installation is lighter and faster. Some users have requested instructions for a clean uninstall that will remove not just the virtual environment but also the packages that were downloaded to set it up (without conflicting with other environments).

Split data.py into data loading and data processing modules

Currently, data.py combines the functionality of data loading specific to our HDF5 file format with data processing which is independent of the data format. Split it into into two separate files, load_HDF5_data.py and process_data.py, where load_HDF5_data.py includes all functionality specific to loading numpy arrays of data from the HDF5 files and process_data.py includes all functionality of data processing (and potentially augmentation) that is independent of the format.

load_HDF5_data.py should include the various HDF5 data loading helper functions, load_data_eventwise_HDF5, load_data_single_tel_HDF5, load_auxiliary_data_HDF5, load_metadata_HDF5, add_processed_parameters, load_image_HDF5, apply_cuts_HDF5, get_data_generators_HDF5 and helper functions. Functionality from the load_data functions involving processing the data (cropping, normalization) should be split out as a call to process_data.py. Since HDF5 is in the name of the module, drop "HDF5" from all the function names.

process_data.py should include crop_image and any split out functionality from the load_data functions.

ctlearn-project / ctlearn Goto Github PK

ctlearn's People

Contributors

Stargazers

Watchers

Forkers

ctlearn's Issues

Recommend Projects

Recommend Topics

Recommend Org