tarteelai / tarteel-ml Goto Github PK

View Code? Open in Web Editor NEW

177.0 34.0 51.0 1.02 MB

Pre-processing and training scripts for the Tarteel Dataset

License: MIT License

Python 28.24% Jupyter Notebook 71.76%

tarteel-ml's Introduction

Tarteel Machine Learning

This repo is designed to house code related to Tarteel machine learning related tasks. 🔬

Specifically, things like:

Model selection ✅
Preprocessing of data 🔉
Model training, validation, and and iteration 🔁
Demos 🚀

Code here is mostly experimental so check back regularly for updates.

If you found this repo helpful, please keep it's contributors in your duaa 🙌.

🔥 To see our technology live in action, visit tarteel.io. 🔥

Getting Started 🔰

Prerequisites

We use Python 3.7 for our development. However, any Python above 3.6 should work. For audio pre-processing, we use ffmpeg and ffprobe. Make sure you install these using your system package manager.

Mac OS

brew install ffmpeg

Linux

sudo apt install ffmpeg

Then install the Python dependencies from requirements.txt.

pip3 install -r requirements.txt

Usage

Use the -h/--help flag for more info on how to use each script.

This repo is structured as follows:

Root

download.py: Download the Tarteel dataset
create_train_test_split.py: Create train/test/validation split csv files.
generate_alphabet|vocabulary.py: Generate all unique letters/ayahs in the Quran in a text file.
generate_csv_deepspeech.py: Create a CSV file for training with DeepSpeech.

Wiki 📜

Check out the wiki for instructions on how to download and pre-process the data, as well as how to start training models.

Contributing 💯

Check out CONTRIBUTING.md to start contributing to Tarteel-ML!

tarteel-ml's People

Contributors

Stargazers

Watchers

tarteel-ml's Issues

Create an overfitted model using ~100 pieces of MFCC data.

Completing this task will show that we have a model structure that should learn with more data.

Error when creating conda env, sox=14.4.2 not found.

Tried to create new conda env using the environment.yml but there is an error.

Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
- sox=14.4.2

Convert all v1 recordings to MFCC representations.

This gives us data that can be fed into a neural network directly.

Write helper functions for working with the one-hot encoding.

None exist yet.

Write a script to download, convert to features, and delete audio by ayah.

Doing it by ayah (specifically restarting the python generate_features script each time) will enable the fastest processing.

pip or conda ?

Hi,

I am confused:

the CONTRIBUTING.md ask me to install stuff using conda and environment.yml (file updated in 2019) containing numpy=1.15.4
on the other hand, if I follow the tutorial in the README.md file, then I will use pip to install the requirements.txt (file updated in 2020) and thus numpy==1.18.2

I would like to know, please, what I should do.
From my personal experience, it is better to use pip as it is more stable and more up-to-date.

And maybe my first contribution would be to fix that :)

Convert all v1 recordings to filter bank representations.

This is a separate task for filter banks from issue #3.

Conda requirements are OSX specific

Conda requirements are currently OSX specific -- either a simpler requirements file should be created by hand (without pinning each individual dependency unnecessarily), or a separate requirements file should be made for Linux systems.

the files for the fattiha overfitting experiment

السلام عليكم

currently i am experimenting a quraan tutor for surat elekhlass, the same idea as surat alfatihah but with another audio set which from a known tutors.
it seems that i have a problem with the preprocessing step : /

can i get the files that were used for the experiment to compare it with mine?
train_src.txt, train_tgt.txt, val_src.txt, val_tgt.txt

thank you

Create a one-hot output encoding for the Arabic alphabet, including harakat and madd.

This task will involve writing a script that creates a list of all the Arabic characters used in Tarteel's quran dataset and then creating a file that dictates a one-hot encoding to be used for as a neural network output.

Refactor utilities related to the Quranic dataset.

As part of a broader refactoring effort, improve the code used to manipulate Quranic data.

Migrate `has_speech` function to use evaluation data once it is ready.

Once we have reached a critical mass of evaluations, we need to transition the has_speech function to use that data instead of webrtc vad.

How to pass a live stream of audio to the ML

Hello

My plan is to make a mobile application for correcting the speech of the user while reading surat Al-EKlass.
i saw your tarteel application and it is fantastic ❤️ thank you for your work.

i already did the ML speech2text with OpenNMT, but i am wondering how did you pass a live stream audio from microphone to recognize it ? My plan was after training the ML i would set up the REST server then doing a GUI. but i am stuck with the audio streaming now :/

appreciate your help..,
thank you

get Invalid wave header found

I have cloned Tarteel-ML repo, started to download csv for Alfatiha as mentioned in wiki but i got error
all i do was

python3 download.py -s 1
Downloading CSV from https://d2sf46268wowyo.cloudfront.net/datasets/tarteel_v1.0.csv
Done downloading CSV.
Invalid wave header found .audio/s1/a4/1_4_2787081723.wav , removing.
Invalid wave header found .audio/s1/a3/recording_FO94M2V.wav , removing.
Invalid wave header found .audio/s1/a1/1_1_3752224010.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_4035742518.wav , removing.
Invalid wave header found .audio/s1/a2/1_2_4115400297.wav , removing.
Audio file .audio/s1/a5/1_5_4027410949.wav does not have speech according to VAD. Removing.
Audio file .audio/s1/a7/1_7_456658554.wav does not have speech according to VAD. Removing.
Invalid wave header found .audio/s1/a2/1_2_2198883921.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_3964846100.wav , removing.
Audio file .audio/s1/a4/1_4_3355668251.wav does not have speech according to VAD. Removing.
Invalid wave header found .audio/s1/a1/1_1_526190118.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_4081852003.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_1540864270.wav , removing.
Audio file .audio/s1/a1/1_1_1740606045.wav does not have speech according to VAD. Removing.
Invalid wave header found .audio/s1/a3/1_3_1486618514.wav , removing.
Invalid wave header found .audio/s1/a5/1_5_1812842962.wav , removing.
Invalid wave header found .audio/s1/a2/1_2_2615458200.wav , removing.
Invalid wave header found .audio/s1/a3/1_3_2791353353.wav , removing.
Invalid wave header found .audio/s1/a3/1_3_619438520.wav , removing.
Audio file .audio/s1/a4/1_4_3452859238_pbxN174.wav does not have speech according to VAD. Removing.
Invalid wave header found .audio/s1/a7/1_7_3771589469.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_1220528519.wav , removing.
Audio file .audio/s1/a3/1_3_409467092_jRelUpO.wav does not have speech according to VAD. Removing.

python 3.7.1
linux ubuntu 16.04

Move feature generation files to an archived code location.

We are moving over to use OpenNMT based utilities for this and have no need of these files anymore.

how to generate language model?

I am planning to train deep speech using the Korean language. Could you provide some guidelines about how can I create a language model?

Train an initial seq2seq model for recitation2text using the v1 dataset.

Implement a simple Keras architecture
Overfit on a small training set of ~100 recordings (#13)
Benchmark model performance on entire training set

Train a model for gender prediction

Based on conversations with @abdulhaim and others, we have realized that it is important to know the gender of the person reciting each recordings to protect gender-based privacy during evaluation (see https://github.com/Tarteel-io/tarteel.io/issues/179).

However, only a small fraction of the recordings have a gender associated with them, because providing demographic information is optional. We can potentially overcome this issue by training a gender-identification model to provide tentative gender labels for our recordings.

Update training utils and models with OpenNMT based methods.

We recently moved away from developing or own feature generation and are conducting a mass refactor of the utilities. This task is related to the files in the training folder.

Error when trying to do "Fatihah overfitting experiment" but with surat Al-Ekhlass

Hello team tarteel, I would like to thank you for your hard work.

currently i am experimenting a quraan tutor for surat elekhlass, the same idea as surat alfatihah but with another audio set which from a known tutors.

I prepared all the files for training, but i face a problem in the training phase.

I run this command ::

!python /content/OpenNMT-py/train.py -model_type audio -enc_rnn_size 512 -dec_rnn_size 512 -audio_enc_pooling 1,2 -dropout 0 -enc_layers 2 -dec_layers 1 -rnn_type LSTM -data /content/OpenNMT-py/data/speech/demo -save_model demo-model -global_attention mlp -gpu_ranks 0 -batch_size 8 -optim adam -max_grad_norm 100 -learning_rate 0.0003 -learning_rate_decay 0.8 -train_steps 2000

the error is :

_[2020-03-04 21:03:57,891 INFO]  * tgt vocab size = 15
[2020-03-04 21:03:57,892 INFO] Building model...
[2020-03-04 21:04:02,067 INFO] NMTModel(
  (encoder): AudioEncoder(
    (W): Linear(in_features=512, out_features=512, bias=False)
    (batchnorm_0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (rnn_0): LSTM(161, 512)
    (pool_0): MaxPool1d(kernel_size=1, stride=1, padding=0, dilation=1, ceil_mode=False)
    (rnn_1): LSTM(512, 512)
    (pool_1): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (batchnorm_1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(15, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.0, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.0, inplace=False)
      (layers): ModuleList(
        (0): LSTMCell(1012, 512)
      )
    )
    (attn): GlobalAttention(
      (linear_context): Linear(in_features=512, out_features=512, bias=False)
      (linear_query): Linear(in_features=512, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
      (linear_out): Linear(in_features=1024, out_features=512, bias=True)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=512, out_features=15, bias=True)
    (1): Cast()
    (2): LogSoftmax()
  )
)
[2020-03-04 21:04:02,067 INFO] encoder: 3747840
[2020-03-04 21:04:02,067 INFO] decoder: 4190555
[2020-03-04 21:04:02,067 INFO] * number of parameters: 7938395
[2020-03-04 21:04:02,068 INFO] Starting training on GPU: [0]
[2020-03-04 21:04:02,068 INFO] Start training loop and validate every 10000 steps...
[2020-03-04 21:04:02,069 INFO] Loading dataset from /content/OpenNMT-py/data/speech/demo.train.0.pt
[2020-03-04 21:04:02,070 INFO] number of examples: 15
Traceback (most recent call last):
  File "/content/OpenNMT-py/train.py", line 6, in <module>
    main()
  File "/content/OpenNMT-py/onmt/bin/train.py", line 204, in main
    train(opt)
  File "/content/OpenNMT-py/onmt/bin/train.py", line 88, in train
    single_main(opt, 0)
  File "/content/OpenNMT-py/onmt/train_single.py", line 143, in main
    valid_steps=opt.valid_steps)
  File "/content/OpenNMT-py/onmt/trainer.py", line 244, in train
    report_stats)
  File "/content/OpenNMT-py/onmt/trainer.py", line 365, in _gradient_accumulation
    with_align=self.with_align)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/OpenNMT-py/onmt/models/model.py", line 45, in forward
    enc_state, memory_bank, lengths = self.encoder(src, lengths)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/OpenNMT-py/onmt/encoders/audio_encoder.py", line 119, in forward
    memory_bank = pool(memory_bank)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/pooling.py", line 76, in forward
    self.return_indices)
  File "/usr/local/lib/python3.6/dist-packages/torch/_jit_internal.py", line 181, in fn
    return if_false(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 457, in _max_pool1d
    input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: Given input size: (7x1x1). Calculated output size: (7x1x0). Output size is too small_

I know that the problem is in the pooling size but i don't know how to fix it.

Update tutorials

Elsalem,

Guys I have spent one day trying to run the first ML model unsuccessfully.
Problem is, there are different tutorials across the repo:

README.md propose to run the following

download.py: Download the Tarteel dataset
create_train_test_split.py: Create train/test/validation split csv files.
generate_alphabet|vocabulary.py: Generate all unique letters/ayahs in the Quran in a text file.
generate_csv_deepspeech.py: Create a CSV file for training with DeepSpeech.

But I am stuck at generate_csv_deepspeech.py
And I don't know what is the purpose of each data generated...

The wiki refer to py files that have been deleted long ago

Navigate into the audio_preprocessing directory and run python generate_features.py

wiki and CONTRIBUTING.md both explains how to set up the repo. It is just redundant...

I suggest that

we update README.md with the minimum instruction to run the simplest ML model. and I would need your help with that please please
we keep using CONTRIBUTING.md to explain how to set up the repo and bring the related content from the wiki here coz contributors can modify README.md easily but not the wiki as explained here #51

`dataset_csv_url` : A ghost argument?

Issue: Exception

In download.py, line 103 throws exception:

'Namespace' object has no attribute 'dataset_csv_url'

while changing it to csv_url, fixes it.

Possible Cause:
Argument mismatch?

line 26 : parser.add_argument('--csv-url', type=str, default=TARTEEL_V1_CSV_URL)

line 103 : download_csv_dataset(args.dataset_csv_url, path_to_dataset_csv)

Update the documentation for how to download the dataset and its audio.

Right now it is out of date as of 21d8e54.

Same number of audio files with or without surah [-s] argument

Description:
Same audio data with or without surah [-s] argument.

$ python3 download.py --use-cache --log CRITICAL
Audio Files:   0%|                                                   | 14/20565 [01:06<26:52:56,  4.71s/it]

python3 download.py -s 1 --use-cache --log CRITICAL
Audio Files:   0%|                                                               | 0/20565 [00:00<?, ?it/s]

Is it normal to have same amount of data in both cases?

Please advise. Regards.

Downloading Surah Al-Fatihah only took long time

I was trying to download and preprocess Al-Fatiha. Here my commands:

git clone https://github.com/Tarteel-io/Tarteel-ML.git
cd Tarteel-ML/
git cherry-pick 624c46b
conda env create -f environment.yml
conda activate tarteel-ml
python download.py -s 1

I applied this commit to fix invalid wave header issue and make it download. However, it took long time for one short surah! Is this normal?

Create a Dockerfile to support repo on multiple OS like Windows

As-salāmu alaykum. I was trying to run this project on Windows, but it is giving me errors in every step. First, it was giving errors in finding the specific versions of the modules mentioned in your requirements.txt file, so I downloaded the latest versions of them manually. This creates the possibility that I will be getting errors in the next steps. Which I am. Currently, I am getting an error at the very next step:
python download.py -s 1

Is it possible for you to create a Docker app, so we can run it on Windows without any hassle? That would likely help a lot of people. Thank you. JazākAllāhu khayrā.

get error IndexError: too many indices for array

i made audio data set from 10 quran readers for each aya
so i have 10 audio files for each aya with sample rate 32000 hz in wav format also i passed all audio files on audio checking function in your download script
so now i have 61382 audio file 10 files foreach aya
so now i tried to run script Sequence-to-Sequence Model in Keras
where i prepared every thing as you build your system

Data/one-hot.pkl
-.outputs/mfcc

but in your script you used
def build_dataset(local_coefs_dir='../.outputs/mfcc', surahs=[1], n=100):
i changed it to
def build_dataset(local_coefs_dir='../.outputs/mfcc', surahs=[2], n=100):

but i got this error
"IndexError: too many indices for array"
while executing this function
"convert_list_of_arrays_to_padded_array"
in this line
"padded_array[a, :r, :c] = arr"

and this is the values stored in memory
shape (1361, 13)
max_shape [13459]
padded_array (100,13459)

Make improvements to the args passed to `download.py` for greater specification of what to download.

This would include

filtering by surah
identifying whether the cache should be ignored in favor of a new csv
additional comments

Create a train-test-validation split for the recordings by verse.

Split the verse in the Qur'an 60-20-20 by verse. All recordings of that verse will be in the same set.

Note: There should be two copies of this split. One of them should be by ayah, and the other should be by unique ayah (i.e. identical ayat should be lumped together in one ayah-set).

Some demographic metadata missing from Tarteel Dataset CSV

Recordings are being uploaded prior to any demographic info submitted (ex. people submit 5 ayahs and then put their demographics).

Potential hotfix is creating a small script to link all AnnotatedRecording objects with the same session_id as a DemographicInformation (which I believe is technically being done here: https://github.com/Tarteel-io/tarteel-api/blob/504ffb933c6b328b888714b5fb20f35f138a6b41/audio/views.py#L432)

How to Add/Edit Wiki pages?

I want to add documentation to the concept of MFCC Coefficients. Also I noticed some typos in existing Wiki pages.

How can I add or edit Wiki pages?

Complete MFCC and Filter Bank script.

MFCC and mel-frequency filter banks are two of the most common features to pass into deep neural networks. Complete a script that calculates these values and verify that it works using MATLAB (or another piece of software).

Write helper functions to read the .npy files that hold the features.

Missing audio_processing directory

Where I can find the audio_processing directory?
following the wiki steps, after step 2, Navigate into the audio_preprocessing directory and run python generate_features.py -f mfcc -s 1 --local_download_dir "../.audio" --output_dir "../.outputs" to generate the MFCC coefficients I couldn't able to find that audio_preprocessing directory.

Validate that mfcc feature generation script works by reconstructing the signal and listening to it.

Tensorflow provides a function for reconstructing a signal from its components. Use it to reconstruct a few of the files and listen to them.

Logging exception due to invalid arguments

Currently this exception is thrown when downloading files: python3 download.py

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 994, in emit
    msg = self.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 840, in format
    return fmt.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 577, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.6/logging/__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "download.py", line 103, in <module>
    download_csv_dataset(args.csv_url, path_to_dataset_csv)
  File "download.py", line 50, in download_csv_dataset
    logging.info("Downloading CSV from ", csv_url, " to ", dataset_csv_path, ".")
Message: 'Downloading CSV from '
Arguments: ('https://tarteel-frontend-static.s3-us-west-2.amazonaws.com/datasets/tarteel_v1.0.csv', ' to ', '.cache/csv/local.csv', '.')

It is due to this line:
logging.info("Downloading CSV from ", csv_url, " to ", dataset_csv_path, ".")

Provide instructions for downloading labelled data

Is the tarteel.io dataset (or a subset of the data) available for download?