Speaker Similarity

pip install -r requirements.txt

Simple Jupiter nb

Extracts MFCCs, delta1 and zero-crossing-rate from audio clips.
Simple Neural Network
Gaussian Mixture Model

Mel-frequency cepstral coefficients (MFCCs)

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal spectrum. This frequency warping can allow for better representation of sound, for example, in audio compression.

Zero-crossing rate (ZCR)

Zero crossing rate is the rate at which a signal changes its sign from positive to negative or vice versa within a given time frame.

Zero-crossing rate can be seen as a measure to calculate the noise of a signal. It shows higher values when noise is present. Also it reflects, the spectral characteristics of a signal. It finds use in applications such as speech-music discrimination, speech detection (as in our case) and music genre classification.

Linear Prediction Coefficients (LPC)

Linear prediction is a mathematical method where future values of a discrete-time signal are estimated as a linear function of previous samples.

Linear predictive coding (LPC) is an operation used a lot in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in a compressed way, using the information of a linear predictive model.

LPC makes the assumption that a speech signal is produced by a buzzer at the end of a tube, with often added hissing and popping sounds. Although apparently not suited, this model is actually a good estimate of the reality of how speech is implemented. The glottis does the buzz, which is characterized by its loudness and frequency. The vocal tract creates the tube, which is characterized by its resonances. These resonances give life to formants, or enhanced frequency bands in the sound produced. Hisses and pops are generated by the action of the tongue, lips and throat during sibilants and plosives.

LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered modeled signal is called the residue.

The numbers which describe the intensity and frequency of the buzz, the formants, and the residue signal, can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the buzz parameters and the residue to create a source signal, use the formants to create a filter, and run the source through the filter, resulting in speech.

Because speech signals vary with time, this process is done on short chunks of the speech signal, which are called frames. Generally, 30 to 50 frames per second give an intelligible speech with good compression.

We implemented the lpc's with librosa library. Librosa etsimates the Linear Prediction Coefficients via Burg’s method. This function applies Burg’s method to estimate coefficients of a linear filter on y of order order. Burg’s method is an extension to the Yule-Walker approach, which are both sometimes referred to as LPC parameter estimation by autocorrelation.

Deep Learning Model

Gaussian Mixture Model

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians. A GMM attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset.

A GMM uses an expectation–maximization approach which qualitatively does the following:

Choose starting guesses for the location and shape
Repeat until converged:

E-step: for each point, find weights encoding the probability of membership in each cluster

M-step: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights

The result of this is that each cluster is associated not with a hard-edged sphere, but with a smooth Gaussian model. Just as in the k-means expectation–maximization approach, this algorithm can sometimes miss the globally optimal solution, and thus in practice multiple random initializations are used.

App

We used Flask micro-webframework to build a basic UI. The user can paste a youtube url and the server will download audio-only with youtube-dl library. Alternatively, he/she can record and upload a recording using webcamera's microphone.

vassilispapadop / speaker-similarity Goto Github PK