The project presents a means of augementing several image process steps using Apache Spark. The projecct considers:
- The optimium number of spark partitions for the workload.
- The best compute configuration for the Dataproc cluster to achieve the fastsest results of the task.
- Analysis of key varriable and how they impact resulting outputs 'batch size' 'dataset_size' 'program repetitions', and experiment repitions, impact the 'images per second' read speed.
This project uses GCP services Dataproc and Colab.
The full code can be found in the following link: https://drive.google.com/file/d/1pT7TLdvCHU_vipxJ2LvYLywIf0rYTupT/view?usp=sharing
Figure 1: Analysis of preprocessed Tfrecord files
Figure 2: Analysis of unprocessed images