castorini / howl Goto Github PK

Wake word detection modeling toolkit for Firefox Voice, supporting open datasets like Speech Commands and Common Voice.

License: Mozilla Public License 2.0

Python 98.27% Shell 1.69% Makefile 0.03%

howl's Introduction

Howl

Wake word detection modeling for Firefox Voice, supporting open datasets like Google Speech Commands and Mozilla Common Voice.

Citation:

@inproceedings{tang-etal-2020-howl,
    title = "Howl: A Deployed, Open-Source Wake Word Detection System",
    author = "Tang, Raphael and Lee, Jaejun and Razi, Afsaneh and Cambre, Julia and Bicking, Ian and Kaye, Jofish and Lin, Jimmy",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.9",
    doi = "10.18653/v1/2020.nlposs-1.9",
    pages = "61--65"
}

Training Guide

Installation

git clone https://github.com/castorini/howl && cd howl
Install PyTorch by following your platform-specific instructions.
Install PyAudio and its dependencies through your distribution's package system.
pip install -r requirements.txt -r requirements_training.txt (some apt packages might need to be installed)
./download_mfa.sh to setup montreal forced aligner (MFA) for dataset generation

Preparing a Dataset

Generating a dataset for a custom wakeword requires three steps:

Generating raw audio dataset that howl can load from open datasets
Generate orthographic transcription alignments for each audio file.
Attach the alignment to the raw audio dataset generated in step 1.

Having said that we recommend Common Voice dataset for the open audio datasets and Montreal Forced Aligner (MFA) for the transcription alignment. Downloading MFA can be achieved simply running download_mfa.sh script. Along with the aligner, the script will download necessary English pronunciation dictionary.

Once they are ready, a dataset can be generated using the following script.

./generate_dataset.sh <common voice dataset path> <underscore separated wakeword (e.g. hey_fire_fox)> <inference sequence (e.g. [0,1,2])> <(Optional) "true" to skip negative dataset generation>

For detailed explanation, please refer to How to generate a dataset for custom wakeword

Training and Running a Model

Source the relevant environment variables for training the res8 model: source envs/res8.env.
Train the model: python -m training.run.train -i datasets/fire/positive datasets/fire/negative --model res8 --workspace workspaces/fire-res8. It's recommended to also use --use-stitched-datasets if the training datasets are small.
For the CLI demo, run python -m training.run.demo --model res8 --workspace workspaces/fire-res8.

train_model.sh is also available which encaspulates individual command into a single bash script

./train_model.sh <env file path (e.g. envs/res8.env)> <model type (e.g. res8)> <workspace path (e.g. workspaces/fire-res8)> <dataset1 (e.g. datasets/fire-positive)> <dataset2 (e.g. datasets/fire-negative)> ...

Pretrained Models

howl-models contains workspaces with pretrained models

To get the latest models, simply run git submodule update --init --recursive

hey firefox

VOCAB='["hey","fire","fox"]' INFERENCE_SEQUENCE=[0,1,2] INFERENCE_THRESHOLD=0 NUM_MELS=40 MAX_WINDOW_SIZE_SECONDS=0.5 python -m training.run.demo --model res8 --workspace howl-models/howl/hey-fire-fox

Installing Howl using pip

Install PyAudio and PyTorch 1.5+ through your distribution's package system.
Install Howl using pip

pip install howl

To immediately use a pre-trained Howl model for inference, we provide the client API. The following example (also found under examples/hey_fire_fox.py) loads the "hey_fire_fox" pretrained model with a simple callback and starts the inference client.

from howl.client import HowlClient

def hello_callback(detected_words):
    print("Detected: {}".format(detected_words))

client = HowlClient()
client.from_pretrained("hey_fire_fox", force_reload=False)
client.add_listener(hello_callback)
client.start().join()

Reproducing Paper Results

First, follow the installation instructions in the quickstart guide.

Google Speech Commands

Download the Google Speech Commands dataset and extract it.
Source the appropriate environment variables: source envs/res8.env
Set the dataset path to the root folder of the Speech Commands dataset: export DATASET_PATH=/path/to/dataset
Train the res8 model: NUM_EPOCHS=20 MAX_WINDOW_SIZE_SECONDS=1 VOCAB='["yes","no","up","down","left","right","on","off","stop","go"]' BATCH_SIZE=64 LR_DECAY=0.8 LEARNING_RATE=0.01 python -m training.run.pretrain_gsc --model res8

Hey Firefox

Download the Hey Firefox corpus, licensed under CC0, and extract it.
Download our noise dataset, built from Microsoft SNSD and MUSAN, and extract it.
Source the appropriate environment variables: source envs/res8.env
Set the noise dataset path to the root folder: export NOISE_DATASET_PATH=/path/to/snsd
Set the firefox dataset path to the root folder: export DATASET_PATH=/path/to/hey_firefox
Train the model: LR_DECAY=0.98 VOCAB='["hey","fire","fox"]' USE_NOISE_DATASET=True BATCH_SIZE=16 INFERENCE_THRESHOLD=0 NUM_EPOCHS=300 NUM_MELS=40 INFERENCE_SEQUENCE=[0,1,2] MAX_WINDOW_SIZE_SECONDS=0.5 python -m training.run.train --model res8 --workspace workspaces/hey-ff-res8

Hey Snips

Download hey snips dataset
Process the dataset to a format howl can load

VOCAB='["hey","snips"]' INFERENCE_SEQUENCE=[0,1] DATASET_PATH=datasets/hey-snips python -m training.run.deprecated.create_raw_dataset --dataset-type 'hey-snips' -i ~/path/to/hey_snips_dataset

Generate some mock alignment for the dataset, where we don't care about alignment:

python -m training.run.attach_alignment \
  --input-raw-audio-dataset datasets/hey-snips \
  --token-type word \
  --alignment-type stub

Use MFA to generate alignment for the dataset set:

mfa_align datasets/hey-snips/audio eng.dict pretrained_models/english.zip datasets/hey-snips/alignments

Attach the MFA alignment to the dataset:

python -m training.run.attach_alignment \
  --input-raw-audio-dataset datasets/hey-snips \
  --token-type word \
  --alignment-type mfa \
  --alignments-path datasets/hey-snips/alignments

Source the appropriate environment variables: source envs/res8.env
Set the noise dataset path to the root folder: export NOISE_DATASET_PATH=/path/to/snsd
Set the noise dataset path to the root folder: export DATASET_PATH=/path/to/hey-snips
Train the model: LR_DECAY=0.98 VOCAB='["hey","snips"]' USE_NOISE_DATASET=True BATCH_SIZE=16 INFERENCE_THRESHOLD=0 NUM_EPOCHS=300 NUM_MELS=40 INFERENCE_SEQUENCE=[0,1] MAX_WINDOW_SIZE_SECONDS=0.5 python -m training.run.train --model res8 --workspace workspaces/hey-snips-res8

Generating dataset for Mycroft-precise

howl also provides a script for transforming howl dataset to mycroft-precise dataset

VOCAB='["hey","fire","fox"]' INFERENCE_SEQUENCE=[0,1,2] python -m training.run.generate_precise_dataset --dataset-path /path/to/howl_dataset

Experiments

To verify the correctness of our implementation, we first train and evaluate our models on the Google Speech Commands dataset, for which there exists many known results. Next, we curate a wake word detection datasets and report our resulting model quality.

For both experiments, we generate reports in excel format. experiments folder includes sample outputs from the for each experiment and corresponding workspaces can be found here

commands_recognition

For command recognition, we train the four different models (res8, LSTM, LAS encoder, MobileNetv2) to detect twelve different keywords: “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”, unknown, or silence.

python -m training.run.eval_commands_recognition --num_iterations n --dataset_path < path_to_gsc_datasets >

word_detection

In this experiment, we train our best commands recognition model, res8, for hey firefox and hey snips and evaluate them with different threashold.

Two different performance reports are generated, one with the clean audio and one with audios with noise

python -m training.run.eval_wake_word_detection --num_models n --hop_size < number between 0 and 1 > --exp_type < hey_firefox | hey_snips > --dataset_path "x" --noiseset_path "y"

We also provide a script for generating ROC curve. exp_timestamp can be found from the reports generated from previous command

python -m training.run.generate_roc --exp_timestamp < experiment timestamp > --exp_type < hey_firefox | hey_snips >

howl's People

Contributors

Stargazers

Watchers

howl's Issues

Model trained sucessfully but not detection result

Hello, I managed to trained the mdoel and I have attached my model in the following link: https://drive.google.com/drive/folders/1UE5NydQhPB8-TeLeLeoc7GOaLTMIBwfP?usp=sharing.

But when I run the howl.run.demo, it didn't detect any result. I am wondering whether is my microphone issue or something is wrong when I train my model. Can you give a try on my trained model? Do you mind to share a pretrained model as well?

The train: ConfusionMatrix(tp=0, fp=0, tn=0, fn=0) is always zero on positive dataset. Is it correct?

Dev negative: 100%|█████████████| 773/773 [01:24<00:00,  9.12it/s, mcc=0.0, c=ConfusionMatrix(tp=0, fp=0, tn=773, fn=0)]
2020-10-29 16:23:26 [INFO] train: ConfusionMatrix(tp=0, fp=0, tn=773, fn=0)
Test positive: 100%|██████████████████| 2/2 [00:00<00:00,  4.64it/s, mcc=0.0, c=ConfusionMatrix(tp=0, fp=0, tn=0, fn=2)]
2020-10-29 16:23:27 [INFO] train: ConfusionMatrix(tp=0, fp=0, tn=0, fn=2)
Test negative: 100%|████████████| 804/804 [01:22<00:00,  9.79it/s, mcc=0.0, c=ConfusionMatrix(tp=0, fp=0, tn=804, fn=0)]

Implementing auto stopping

Currently, the training process only uses positive dev/test sets for intermediate evaluation.
We will first need to update the code to use some of negative sets before we can implement proper auto-stopping

Browser deployment

Hi, I've trained a model and now would like to convert it into weights for browser deployment like hey_firefox.js. Could you help me understand how that file was generated?

Training without GPU

Hi, I currently don't have a GPU but a high-performance CPU cluster.

It is asking me to install NVIDIA drivers - any way around that?

speeding up create_raw_dataset.py

create_raw_dataset.py takes quite a long time to generate datasets.

I thinking multi-threading AudioDatasetMetadataWriter write will do the job.

Also, this process terminates with segfault

Setting up CI and implement testing

Streamline preprocessing pipeline

Data preprocessing is currently split into multiple steps, i.e.,

Download the datasets (where?).
Run run.preprocess_dataset.
Write the corresponding *.lab files using run.export_mfa.
Download Montreal Forced Aligner (MFA) and the corresponding CMU phonetic dictionary.
Run MFA (mfa_align) over the speech corpus.
Convert the output TextGrids to our jsonl format (run.attach_mfa_alignment).

We should make this process easier and document it somewhere.

Document codebase and generate web docs with Sphinx

Document the codebase.
Generate HTML documentation using Sphinx with the Napoleon and autodoc extensions.
Write some tutorials and getting started guides.
Create a GitHub page for Howl with the root directory set to docs/.

See https://flask.palletsprojects.com/en/1.1.x/ for a good Sphinx example.

French dataset

Hello,

Can I build my own french dataset for some keywords using download_mfa.sh and generate_dataset.sh like "Preparing a Dataset" ? If yes, can you explain me some tips for that?

Have a good day! Thank you so much!
Minh Toan.

printing out how many instances of vocab is captured from dataset generation

I have tried to generate datasets for love you

However most of the samples are captured for you and samples for love was only a handful.

It would be beneficial to see how many instances are captured for each vocab and keep the amount simliar

Add pretrained models and Hey Firefox dataset to README

Self-explanatory.

Example of the CTC loss usage

Great project.

I've noticed that the CTC loss is implemented in the code but no usage example is provided in the README.
Tried it myself but got some errors, also it seems that the labels are still binary.

Would be great to get an example of how to use this setup.

restructuring/refactoring files for howl/training

some files under howl/data/dataset (labeller, phone) should be located outside as a common module that can be added separately

I have noticed some classes are incorrectly located and can be organized better.

Errors when reproducing Hey Firefox results

Just wanted to note down a few issues trying to reproduce the Howl results:

It seems like the -i/--dataset_paths flag for setting the dataset path doesn't pick up the values (I tried on both Mac and Linux). Instead, it works fine if I just set DATASET_PATH as an environment variable.

So I did

DATASET_PATH=/path/to/hey/firefox LR_DECAY=0.98 VOCAB='[" hey","fire","fox"]' USE_NOISE_DATASET=True BATCH_SIZE=16 INFERENCE_THRESHOLD=0 NUM_EPOCHS=300 NUM_MELS=40 INFERENCE_SEQUENCE=[0,1,2] MAX_WINDOW_SIZE_SECONDS=0.5 python -m howl.run.train --model res8 --workspace workspaces/hey-ff-res8

Instead of

LR_DECAY=0.98 VOCAB='[" hey","fire","fox"]' USE_NOISE_DATASET=True BATCH_SIZE=16 INFERENCE_THRESHOLD=0 NUM_EPOCHS=300 NUM_MELS=40 INFERENCE_SEQUENCE=[0,1,2] MAX_WINDOW_SIZE_SECONDS=0.5 python -m howl.run.train --model res8 --workspace workspaces/hey-ff-res8 -i /path/to/hey/firefox

Stack trace:

Traceback (most recent call last):
  File "/home/edwinzhang64/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/edwinzhang64/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/edwinzhang64/howl/howl/run/train.py", line 217, in <module>
    main()
  File "/home/edwinzhang64/howl/howl/run/train.py", line 93, in main
    opt('--dataset-paths', '-i', type=str, nargs='+', default=[SETTINGS.dataset.dataset_path]),
  File "/home/edwinzhang64/howl/howl/settings.py", line 72, in dataset
    self._dataset = DatasetSettings()
  File "pydantic/env_settings.py", line 28, in pydantic.env_settings.BaseSettings.__init__
  File "pydantic/main.py", line 338, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for DatasetSettings
dataset_path
  field required (type=value_error.missing)

It seems like there is an error that occurs on line 35 in workspace.py when I try to train the model by following the Hey Firefox replication steps. It isn't able to serialize some PosixPath to JSON when calling json.dump(gather_dict(args), f, indent=2). Training works if I just comment out the line in write_args.

train: {'zmuv_mean': tensor([-1.7890], device='cuda:0'), 'zmuv_std': tensor([3.9339], device='cuda:0')}
Traceback (most recent call last):
  File "/home/edwinzhang64/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/edwinzhang64/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/edwinzhang64/howl/howl/run/train.py", line 217, in <module>
    main()
  File "/home/edwinzhang64/howl/howl/run/train.py", line 178, in main
    ws.write_args(args)
  File "/home/edwinzhang64/howl/howl/model/workspace.py", line 35, in write_args
    json.dump(gather_dict(args), f, indent=2)
  File "/home/edwinzhang64/anaconda3/lib/python3.7/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/home/edwinzhang64/anaconda3/lib/python3.7/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/home/edwinzhang64/anaconda3/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/edwinzhang64/anaconda3/lib/python3.7/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/home/edwinzhang64/anaconda3/lib/python3.7/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/home/edwinzhang64/anaconda3/lib/python3.7/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type PosixPath is not JSON serializable

size mismatch RuntimeError running CLI demo

Hey there, I was just trying to recreate the Hey Firefox model using the provided datasets. Training seems to go smoothly using PR #32 but when attempting to run the CLI demo with

python -m howl.run.demo --model res8 --workspace workspaces/hey-ff-res8

I'm getting the following RuntimeError:

Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/howl/howl/run/demo.py", line 106, in <module>
    main()
  File "/home/user/howl/howl/run/demo.py", line 85, in main
    ws.load_model(model, best=True)
  File "/home/user/howl/howl/model/workspace.py", line 47, in load_model
    model.load_state_dict(torch.load(self.model_path(best=best), lambda s, l: s))
  File "/home/user/howl/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 846, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Res8:
	size mismatch for output.weight: copying a param with shape torch.Size([4, 45]) from checkpoint, the shape in current model is torch.Size([2, 45]).
	size mismatch for output.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([2]).

The CLI demo ran on the first model I trained (with custom data) however I couldn't get it to recognize anything so am trying to verify my setup with the Hey Firefox example first.

Any ideas would be greatly appreciated.

Thanks

Inference weight now requires extra weight for OOV

when the weight is provided append extra entry for OOV label

pip install howl does not work

As the title says - running pip install howl ends up with the following output:

ERROR: Could not find a version that satisfies the requirement howl
ERROR: No matching distribution found for howl

data sets

Dear Author,
Thanks for sharing your package. In your example to generate the data set, "fire" has 2 parts of data, positive and negative. What is the positive data? Was it pre-recorded? Also, if i have a new word to detect, for example, hakunamatata, how to obtain the datasets?

Thanks,
WWY

Howl for different languages

I am currently building a pipeline for a research project which requires KWS - I am confused which one would be better off.

In our use-case, we want to identify key-words over streams of audio data and not in wake word setting. Can I use Howl for that purpose?
The model will be served via an API and since it is supervised learning - we want to readily be able to add newer words overtime as well.

heysnips dataset process

Hi,
I tried the code for heysnips dataset process and get the following error. Does howl no longer support heysnips dataset?

VOCAB='["hey","snips"]' INFERENCE_SEQUENCE=[0,1] DATASET_PATH=datasets/hey-snips python -m training.run.deprecated.create_raw_dataset --dataset-loader-type 'hey-snips' -i /home/shenchena/Downloads/dataset/hey_snips_kws_4.0/hey_snips_research_6k_en_train_eval_clean_ter/
usage: create_raw_dataset.py [-h] [--negative-pct NEGATIVE_PCT]
[--positive-pct POSITIVE_PCT]
[--input-audio-dataset-path INPUT_AUDIO_DATASET_PATH]
[--dataset-loader-type {mozilla-cv,raw}]
create_raw_dataset.py: error: argument --dataset-loader-type: invalid choice: 'hey-snips' (choose from 'mozilla-cv', 'raw')

Requirements conflict: torchvision requires torch==1.5.1

Just doing a fresh setup and saw this mismatch when pip installing the requirements:

ERROR: torchvision 0.6.1 has requirement torch==1.5.1, but you'll have torch 1.5.0 which is incompatible.

https://github.com/pytorch/vision/releases/tag/v0.6.1

pretrained hey snips model?

Dear author:
thanks for making it open and providing the hey firefox checkpoint. I wonder do you have the pretrained hey snips models? if so, would you kindly share it with me。。。 thank you very much.

Integrating hyper parameter tuning framework

https://optuna.org/ might be a good option

AlignedAudioClipMetadata class not present

DATASET_PATH=data/fire-negative python -m howl.run.attach_alignment --align-type stub
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/arvind/beatthat/beatthat/howl/howl/run/attach_alignment.py", line 9, in
from howl.data.dataset import AudioClipDatasetLoader, AudioDatasetMetadataWriter, AlignedAudioClipMetadata
ImportError: cannot import name 'AlignedAudioClipMetadata'

Dataset Generation Question

First off, thanks for this awesome repo! Helping me a lot with my project!!!

Anyway, I'm a bit confused as to how the program is generating the samples that it does. For example, I chose a single wake word and generated a dataset from the speech commands dataset. For the positive set, I get

Generate training datasets: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 509/509 [01:03<0
"Number of speakers in corpus: 1, average number of utterances per speaker: 518.0."

However, when I follow the rest of the generation steps, I end up with a dataset of 10K examples. Im just a bit confused as to where these extra samples came from? Are they duplicates or some sort of augmented version of themselves? In the paper you mention-
"For improved robustness and better quality, we implement a set of popular augmentation routines: time stretching, time shifting, synthetic noise addition, recorded noise mixing, SpecAugment (no time warping; Park et al., 2019), and vocal tract length perturbation (Jaitly and Hinton, 2013). These are readily extensible, so practitioners may easily add new augmentation modules."

I am mainly using this repo for dataset generation, so I wasn't sure if this was just talking about your model preprocessing, or if you perhaps implemented this in your dataset generation code.... I would dig through the code a bit more, but I figured it would be pretty quick/straightforward question for you guys and possibly be useful for someone else down the line....

Thanks,
Brett

Using model on the mobile device (Android/iOS using TensorFlow Lite)

Would it be possible to use your pretrained model in the TensorFlow? To be more specific, I was thinking about the possibility to use it in TensorFlow Lite on mobile devices (Android/iOS). I would like to try to build real-time, offline keyword spotting system and starting from detecting "Hey, Firefox" using your model on mobile device would be great.

improving vocab searching logic

when we search for target vocab in the transcripts, we use minimal string search

As a result, we generate a dataset with incorrect samples

for example, we are capturing samples with they for hey and young for you.

This has to be fixed for improving the dataset quality

Confirming the functionality of token type == 'phone'

token type = 'phone' setting may not be working.

we need to check its functionality and make sure an example is included

pydantic. Preparing a Dataset problems

Hello, I am trying to prepare dataset as in instruction, but having problems with pydantic. Some of them I solved by using
from pydantic_settings import BaseSettings
instead of
from pydantic import BaseSettings
in scripts settings.py and config.py, but still get an error

2023-11-09 19:54:55 WARNING setup_logger(30) Removing existing handlers from generate_raw_audio_dataset.py logger
2023-11-09 19:54:55,642 INFO setup_logger(54) Set up logger (generate_raw_audio_dataset.py), output path: None
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/artem/Artem/gaz/wake_w/howl/howl/training/run/generate_raw_audio_dataset.py", line 139, in
main(
File "/home/artem/Artem/gaz/wake_w/howl/howl/training/run/generate_raw_audio_dataset.py", line 35, in main
raw_dataset_generator = RawAudioDatasetGenerator(input_audio_dataset_path, dataset_type, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/artem/Artem/gaz/wake_w/howl/howl/howl/dataset/raw_audio_dataset_generator.py", line 35, in init
self.inference_ctx = InferenceContext(vocab=SETTINGS.training.vocab, token_type=SETTINGS.training.token_type)
^^^^^^^^^^^^^^^^^
File "/home/artem/Artem/gaz/wake_w/howl/howl/howl/settings.py", line 139, in training
self._training = TrainingSettings()
^^^^^^^^^^^^^^^^^^
File "/home/artem/anaconda3/lib/python3.11/site-packages/pydantic_settings/main.py", line 71, in init
super().init(
File "/home/artem/anaconda3/lib/python3.11/site-packages/pydantic/main.py", line 164, in init
pydantic_self.pydantic_validator.validate_python(data, self_instance=pydantic_self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for TrainingSettings
phone_dictionary
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.4/v/string_type

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 7089: invalid continuation byte

hello, when I run the mfa_align step and it report the eror that shown in title. May i know do you all have any idea on it?

Unable to stitch vocab samples

Dear Author,
Thanks for sharing your package. However, I cannot stitch vocab samples.
When I use python -m training.run.stitch_vocab_samples --dataset-path "datasets/fire/positive" ,I get this error:RuntimeError: Error opening '/tmp/temp.wav': System error.
Maybe it is because I use Windows and download Windows version of MFA.
I am Looking forward to your reply and please forgive my poor English.
Thanks,
XZR

missing prerequisites

hi,

first - this is a really great project.
second, I found I had to manually install the following:

libopenblas-dev
libpulse-dev
portaudio19-dev
swig
swig-sphinxbase

maybe consider adding these to the readme?
also, in the read me you refer to "mfa_align" command and it should be "mfa align"

stitching audio samples to generate diverse positive dataset

for some wake words, we only get few dev/test samples as their transcript must be equal to wakeword

For example, when I attempted to generate dataset for love you, dev and test datasets only contained two samples each.

Given that train set contains samples whose transcript contains at least one of the vocabs and aligned with the audio (by mfa)

we can possibly stitch some of the samples to generate synthetic wakeword samples

for example, stitching hey baby, I saw fire and wow, there was a fox to generate a sample for hey firefox

pocketsphinx thowing NoneType error while stitching samples

while runing training.run.stitch_vocab_samples gave a error after few 1-10 training.wav files generation

File "/home/nsl/asr/howl/howl/utils/sphinx_keyword_detector.py", line 25, in detect
result = phrase.segments(detailed=True)
File "/home/nsl/.local/lib/python3.8/site-packages/pocketsphinx/init.py", line 134, in segments
return [
TypeError: 'NoneType' object is not iterable

Could you please help with issue here

Pretrained model streaming runtime error.

I wanted to see a demo of the project using the pre-trained model. But this error occurred:

2022-04-13 20:36:43 WARNING setup_logger(30) Removing existing handlers from HowlClient logger
2022-04-13 20:36:43,874 INFO setup_logger(54) Set up logger (HowlClient), output path: None
Using cache found in /home/adib/.cache/torch/hub/castorini_howl_master
2022-04-13 20:36:44.069002: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-04-13 20:36:44 INFO _init_num_threads(157) NumExpr defaulting to 4 threads.
2022-04-13 20:36:45 INFO init(97) target hey is assigned to label 0
2022-04-13 20:36:45 INFO init(97) target fire is assigned to label 1
2022-04-13 20:36:45 INFO init(97) target fox is assigned to label 2
2022-04-13 20:36:45 INFO init(97) target [OOV] is assigned to label 3
ALSA lib pcm_dsnoop.c:638:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1075:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2660:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2660:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2660:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for card
ALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for card
ALSA lib pcm_dmix.c:1075:(snd_pcm_dmix_open) unable to open slave
2022-04-13 20:36:45,478 INFO start(140) Starting Howl inference client...
torch.Size([8000])
torch.Size([1, 40, 41])
Traceback (most recent call last):
File "/home/adib/Projects/wake word detection/howl/howl/client/howl_client.py", line 95, in _on_audio
if self.engine.infer(inp):
File "/home/adib/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/adib/Projects/wake word detection/howl/howl/model/inference.py", line 240, in infer
self.ingest_frame(window.squeeze(0), self.curr_time)
File "/home/adib/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/adib/Projects/wake word detection/howl/howl/model/inference.py", line 263, in ingest_frame
transformed_frame = self.zmuv(self.std(frame.unsqueeze(0)))
File "/home/adib/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/adib/Projects/wake word detection/howl/howl/data/transform/transform.py", line 77, in forward
x = self.passthrough(x, **kwargs)
File "/home/adib/Projects/wake word detection/howl/howl/data/transform/transform.py", line 241, in passthrough
return self.execute_op(self.spec_transform, audio, **kwargs)
File "/home/adib/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/adib/Projects/wake word detection/howl/howl/data/transform/transform.py", line 229, in execute_op
if not deltas_only : log_mels = op(audio).add(1e-7).log().contiguous()
File "/home/adib/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/adib/anaconda3/lib/python3.8/site-packages/torchaudio/transforms.py", line 480, in forward
specgram = self.spectrogram(waveform)
File "/home/adib/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/adib/anaconda3/lib/python3.8/site-packages/torchaudio/transforms.py", line 96, in forward
return F.spectrogram(
File "/home/adib/anaconda3/lib/python3.8/site-packages/torchaudio/functional/functional.py", line 91, in spectrogram
spec_f = torch.stft(
File "/home/adib/anaconda3/lib/python3.8/site-packages/torch/functional.py", line 578, in stft
input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
File "/home/adib/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 4006, in _pad
return torch._C._nn.reflection_pad1d(input, pad)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 120, 41]
Traceback (most recent call last):
File "test.py", line 9, in
client.start().join()
File "/home/adib/Projects/wake word detection/howl/howl/client/howl_client.py", line 148, in join
time.sleep(0.04)
RuntimeError

Do you know how can I solve it?

is there the inference.py？

not able to create negative dataset

Hi,
thanks for the great work. When I try to create positive dataset using the readme for keyword 'fire', it works fine but when i try to create the negative datset it hangs forever. ANy idea where might be the problem?

Make quickstart guide

quickstart guide with Jupyter notebook or something

zero result on hey_firefox training?

Dear author:
I succeed to start to training on hey firefox dataset. But i noticed the evaluation resulit is all zero? any one give some tips? thank you very much!

Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1664/1664 [00:23<00:00, 69.67it/s, loss=0.00108]
Dev positive: 0it [00:00, ?it/s]███████████████████████████████████████████████████████████████████████████████████████████████████████▉| 1663/1664 [00:23<00:00, 95.07it/s, loss=0.00108]
2021-11-29 18:36:12,894 INFO evaluate_engine(77) **ConfusionMatrix(tp=0, fp=0, tn=0, fn=0)**
Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1664/1664 [00:22<00:00, 72.56it/s, loss=0.00111]
Dev positive: 0it [00:00, ?it/s]██████████████████████████████████████████████████████████████████████████████████████████████████████▉| 1663/1664 [00:22<00:00, 119.07it/s, loss=0.00111]
2021-11-29 18:36:35,836 INFO evaluate_engine(77) **ConfusionMatrix(tp=0, fp=0, tn=0, fn=0)**

Unable to generate raw dataset.

Hi, I'm trying to generate a custom dataset as instructed.
After the filtering process, I get this error:

Traceback (most recent call last):
File "/home/boaz/miniconda3/envs/howlenv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/boaz/miniconda3/envs/howlenv/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/boaz/howl/training/run/generate_raw_audio_dataset.py", line 139, in
main(
File "/home/boaz/howl/training/run/generate_raw_audio_dataset.py", line 56, in the main
raw_dataset_generator.generate_datasets(positive_dataset_path, SampleType.POSITIVE, percentage=positive_pct)
File "/home/boaz/howl/howl/dataset/raw_audio_dataset_generator.py", line 92, in generate_datasets
dataset.print_stats(self.logger, word_searcher=word_searcher, compute_length=True)
File "/home/boaz/howl/howl/data/dataset/dataset.py", line 233, in print_stats
log_msg = header + " "
TypeError: unsupported operand type(s) for +: 'Logger' and 'str'

I'm using ubuntu 22.04.1
python 3.9
pytorch 1.10
pyaudio-0.2.11

Thanks in advance for the help :)

Not able to run hey_fire_fox demo

I'm following a steps described in the Quickstart Guide section, but unfortunately it doesn't work for me. I'm successfully installing all of the required dependencies and when I try to run hey_fire_fox demo I'm facing a following issue:

engine, ctx = _load_model(pretrained, "res8", "howl/hey-fire-fox", **kwargs)
TypeError: _load_model() missing 1 required positional argument: 'device'

In the hubconf.py file I can see that _load_model function takes a device argument which is not present in the 32nd line. I can fix it by myself (by adding cpu to the function call) after the script downloads a latest version of code and I'm not overriding this (force_reload=False). However then I'm getting another error:

  File "/usr/local/lib/python3.9/site-packages/howl/model/inference.py", line 251, in __init__
    super().__init__(*args, **kwargs)
TypeError: __init__() missing 1 required positional argument: 'negative_label'

After I fix this one (by putting negative_label = 1), then I get the following one:

  File "/usr/local/lib/python3.9/site-packages/howl/model/inference.py", line 136, in to
    self.zmuv = self.zmuv.to(device)
AttributeError: 'InferenceContext' object has no attribute 'to'

I checked ZmuvTransform class and it in fact doesn't have to function, but this issue I don't know how to fix.

I'm using howl 0.1.2 and python 3.9.7

Howl on CPU

The instruction may not be up to date for CPU version

Unknown Error in MFA Preventing Dataset Creation

Hi,

I am trying to generate a custom dataset for the wake word "scissors" according to the dataset generation instructions. However, I am encountering this error which complains about a missing file. The error concerns MFA.

/home/mago3421/howl/montreal-forced-aligner$ ./bin/mfa_align --num_jobs 12 ../datasets/scissors/positive/audio librispeech-lexicon.txt pretrained_models/english.zip ../datasets/scissors/positive/alignment align.py:60: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Setting up corpus information... Number of speakers in corpus: 1, average number of utterances per speaker: 12.0 /home/mago3421/howl/montreal-forced-aligner/lib/aligner/models.py:87: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Creating dictionary information... Setting up training data... There were words not found in the dictionary. Would you like to abort to fix them? (Y/N)N Calculating MFCCs... Traceback (most recent call last): File "aligner/command_line/align.py", line 186, in <module> File "aligner/command_line/align.py", line 142, in validate_args File "aligner/command_line/align.py", line 94, in align_corpus File "aligner/aligner/pretrained.py", line 74, in __init__ File "aligner/aligner/pretrained.py", line 122, in setup File "aligner/aligner/base.py", line 89, in setup File "aligner/corpus.py", line 979, in initialize_corpus File "aligner/corpus.py", line 852, in create_mfccs File "aligner/corpus.py", line 863, in _combine_feats FileNotFoundError: [Errno 2] No such file or directory: '/home/mgonza927/Documents/MFA/audio/train/mfcc/raw_mfcc.0.scp' [1929911] Failed to execute script align

Thanks in advance.