- This tinyML system uses a Raspberry Pi Pico and TensorFlow Lite for Microcontrollers to detect loud talking. It can be utilized to encourage people in restaurants/cafes to eat quietly to prevent the spread of the coronavirus and help in the fight against COVID
- It detects "talking" when people talk loudly
- It doesn't detect "talking" when people talk quietly or the sound is not talking (e.g. noise, music, etc.)
- Deep Learning Model
- A deep learning model is created to classify 10 or 5 seconds of audio to two types of sound ("Talking", "Not Talking")
- Change the value of
CLIP_DURATION
andkClipDuration
to switch clip duration (10sec/5sec)
- Change the value of
- The model is converted to TensorFlow Lite for Microcontrollers format
- The training runs on Google Colaboratory
- A deep learning model is created to classify 10 or 5 seconds of audio to two types of sound ("Talking", "Not Talking")
- Device
- The model is deployed to a Raspberry Pi Pico
- A microphone and a display are connected to the Raspberry Pi Pico
- The Raspberry Pi Pico captures sound from the microphone, judges whether it's loud talking and outputs a result to the display
- Not Talking:
- Quiet
- Voice from a distance
- Music
- Noise
- Others
- Talking:
- Voice
- Voice (more than one person)
- Voice + Music
- Voice + Noise
- Raspberry Pi Pico
- PDM microphone (SPM0405HD4H)
- OLED (SSD1306, I2C, 128x64)
git clone https://github.com/iwatake2222/pico-loud_talking_detector.git
cd pico-loud_talking_detector
git submodule update --init
cd pico-sdk && git submodule update --init && cd ..
mkdir build && cd build
# For Windows Visual Studio 2019 (Developer Command Prompt for VS 2019)
# cmake .. -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Debug -DPICO_DEOPTIMIZED_DEBUG=on
cmake .. -G "NMake Makefiles"
nmake
# For Windows MSYS2 (Run the following commands on MSYS2)
# cmake .. -G "MSYS Makefiles" -DCMAKE_BUILD_TYPE=Debug -DPICO_DEOPTIMIZED_DEBUG=on
cmake .. -G "MSYS Makefiles"
make
- If you want to debug this project on PC, please run cmake in
pj_loud_talking_detector/pj_loud_talking_detector
directory - The created project uses TestBuffer instead of microphone. You can change audio data by modifying C array in
test_audio_data.h
wave2array.py
is useful to convert wave file to C array.- I tested on Visual Studio 2019 and I'm not sure it works on other environments
- Run the training script 01_script/training/train_micro_speech_model_talking_10sec.ipynb on Google Colaboratory. It takes around 10 hours to train the model using GPU instance
- The original script is https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/train/train_micro_speech_model.ipynb
- I made some modifications:
- Use my dataset
- Mix noise manually:
- Prepare original data: [Talking]
- Mix background: [Talking, Talking + Background]
- Mix noise: [Talking, Talking + Background, Talking + Noise, Talking + Background + Noise]
- Separate test data from training data completely
- Some clips are generated from the same video, so data leakage may happen if I randomly separate data following the original script
- Change the wanted word list from [yes, no] to [talking, not_talking]
- Remove "SILENCE" and "UNKNOWN" category
- Because "SILENCE" and "UNKNOWN" are parts of "Not Talking"
- Change clip duration from 1 sec to 10 sec
- Increase training steps
- Note: You cannot run the script because data download will fail. I don't share dataset due to copyright restrictions
- Details
- Talking:
- Talk show from YouTube
- TV show
- Not Talking:
- https://research.google.com/audioset/dataset/index.html
- Music from YouTube
- Background (for augmentation and "Not Talking")
- Restaurant / Coffee shop ambience
- Noise (for augmentation and "Not Talking")
- White noise, pink noise
- Talking:
- The number of data
- 10 sec model
- Talking: 22,144
- Not Talking: 20,364
- 5 sec model
- Talking: 38,854
- Not Talking: 36,557
- 10 sec model
Ths original project is from https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/micro_speech .
- AudioBuffer:
- provides an interface to access storead audio data in a ring block buffer
- has three implementations: ADC (for analog mic connected to ADC), PDM (for PDM mic), TestBuffer (prepared data array). I use PDM in this project
- RingBlockBuffer:
- consists of some blocks. The block size is 512 Byte
and the size is equal to DMA's transfer sizeand the size is equal to the buffer size for PDM mic module - 512 Byte ( 32 msec @16kHz ) is also convenient to work with FeatureProvider which generates feature data using 30 msec of audio data at 20 msec intervals
- consists of some blocks. The block size is 512 Byte
- AudioProvider:
- copies data from the ring block buffer to the local buffer for the requested time
- converts data from uint8_t to int16_t if needed
- allocates the data on sequential memory address
- FeatureProvider:
- almost the same as the original code.
- Judgement:
- judges whether the captured sound is "talking" using the following conditions:
- current score of "talking" >= 0.8
- average score of "talking" >= 0.6
- decibel >= -20 [dB] (0 [dB] is the max value of input (65536))
- judges whether the captured sound is "talking" using the following conditions:
10 sec model | 5 sec model | |
---|---|---|
Accuracy | 96.1 [%] | 94.8 [%] |
Processing time | --- | --- |
__Total | 554 [msec] | 298 [msec] |
__Preprocess | 61 [msec] | 31 [msec] |
__Inference | 455 [msec] | 226 [msec] |
__Other | 38 [msec] | 41 [msec] |
Power consumption | 3.3 [V] x 21 [mA] | 3.3 [V] x 22 [mA] |
Note: Power consumption is measured without OLED (with OLED, it's around 26 [mA]).
- This is a very tiny system ( fittiing in FRISK ! ) , so that it can be implemented in an order call system in restaurants to encourage customers to eat quietly to prevent the spread of the coronavirus
- Need to reduce power consumption
- Current system continuously captures audio and runs inference. However, a quick response is not so important for many cases. The frequency of inference can be decreased, probably once every several seconds or once a minute is enouhgh
- Or using an analog circuit to check voice level and kick pico in sleep mode may be a good idea
- Need to improve accuracy
- So far, the training data is very limited (most of them are Japanese)
- pico-sdk
- https://github.com/raspberrypi/pico-sdk
- Copyright (c) 2020 Raspberry Pi (Trading) Ltd.
- pico-tflmicro
- https://github.com/raspberrypi/pico-tflmicro
- Copyright 2019 The TensorFlow Authors. All Rights Reserved.
- pico-microphone
- https://github.com/sandeepmistry/pico-microphone
- Copyright (c) 2021 Arm Limited and Contributors. All rights reserved.
- Illustrations