This is a small, work in progress project of mine written mostly for hobby purposes. Due to time constraints, I wasn't able to procure enough data to train my own models and had to use pre-trained speech recognition and scene detection models.
DeepSpeech (GPU) v0.9.3 - https://github.com/mozilla/DeepSpeech/tree/v0.9.3
TensorFlow (GPU) v2.3.0 - https://github.com/tensorflow/tensorflow/tree/v2.3.0-rc2
PySceneDetect - https://github.com/Breakthrough/PySceneDetect
DeepSpeech acoustic model - https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmmm
DeepSpeech language model - https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
"Note that the model currently performs best in low-noise environments with clear recordings and has a bias towards US male accents. This does not mean the model cannot be used outside of these conditions, but that accuracy may be lower. Some users may need to train the model further to meet their intended use-case."
Release Page - https://github.com/mozilla/DeepSpeech/releases/tag/v0.9.3
Adopted (not my models) Image & Language Recognition Neural Network