Git Product home page Git Product logo

encodec's Introduction

EnCodec: High Fidelity Neural Audio Compression

linter badge tests badge

This is the code for the EnCodec neural codec presented in the High Fidelity Neural Audio Compression [abs]. paper. We provide our two multi-bandwidth models:

  • A causal model operating at 24 kHz on monophonic audio trained on a variety of audio data.
  • A non-causal model operating at 48 kHz on stereophonic audio trained on music-only data.

The 24 kHz model can compress to 1.5, 3, 6, 12 or 24 kbps, while the 48 kHz model support 3, 6, 12 and 24 kbps. We also provide a pre-trained language model for each of the models, that can further compress the representation by up to 40% without any further loss of quality.

For reference, we also provide the code for our novel MS-STFT discriminator and the balancer.

Schema representing the structure of Encodec,
    with a convolutional+LSTM encoder, a Residual Vector Quantization in the middle,
    followed by a convolutional+LSTM decoder. A multiscale complex spectrogram discriminator is applied to the output, along with objective reconstruction losses.
    A small transformer model is trained to predict the RVQ output.

Samples

Samples including baselines are provided on our sample page. You can also have a quick demo of what we achieve for 48 kHz music with EnCodec, along with entropy coding, by clicking the thumbnail (original tracks provided by Lucille Crew and Voyageur I).

Thumbnail for the sample video.
	You will first here the ground truth, then ~3kbps, then 12kbps, for two songs.

πŸ€— Transformers

Encodec has now been added to Transformers. For more information, please refer to Transformers' Encodec docs.

You can find both the 24KHz and 48KHz checkpoints on the πŸ€— Hub.

Using πŸ€— Transformers, you can leverage Encodec at scale along with all the other supported models and datasets. ⚑️ Alternatively you can also directly use the encodec package, as detailed in the Usage section.

To use first you'd need to set up your development environment!

pip install -U datasets 
pip install git+https://github.com/huggingface/transformers.git@main

Then, start embedding your audio datasets at scale!

from datasets import load_dataset, Audio
from transformers import EncodecModel, AutoProcessor

# dummy dataset, however you can swap this with an dataset on the πŸ€— hub or bring your own
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

# load the model + processor (for pre-processing the audio)
model = EncodecModel.from_pretrained("facebook/encodec_24khz")
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")

# cast the audio data to the correct sampling rate for the model
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
audio_sample = librispeech_dummy[0]["audio"]["array"]

# pre-process the inputs
inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")

# explicitly encode then decode the audio inputs
encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]

# or the equivalent with a forward pass
audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values

# you can also extract the discrete codebook representation for LM tasks
# output: concatenated tensor of all the representations
audio_codes = model(inputs["input_values"], inputs["padding_mask"]).audio_codes

What's up?

See the changelog for details on releases.

Installation

EnCodec requires Python 3.8, and a reasonably recent version of PyTorch (1.11.0 ideally). To install EnCodec, you can run from this repository:

pip install -U encodec  # stable release
pip install -U git+https://[email protected]/facebookresearch/encodec#egg=encodec  # bleeding edge
# of if you cloned the repo locally
pip install .

Supported platforms: we officially support only Mac OS X (you might need XCode installed if running on a non Intel Mac), and recent versions of mainstream Linux distributions. We will try to help out on Windows but cannot provide strong support. Any other platform (iOS / Android / onboard ARM) are not supported.

Usage

You can then use the EnCodec command, either as

python3 -m encodec [...]
# or
encodec [...]

If you want to directly use the compression API, checkout encodec.compress and encodec.model. See hereafter for instructions on how to extract the discrete representation.

Model storage

The models will be automatically downloaded on first use using Torch Hub. For more information on where those models are stored, or how to customize the storage location, checkout their documentation.

Compression

encodec [-b TARGET_BANDWIDTH] [-f] [--hq] [--lm] INPUT_FILE [OUTPUT_FILE]

Given any audio file supported by torchaudio on your platform, compresses it with EnCodec to the target bandwidth (default is 6 kbps, can be either 1.5, 3, 6, 12 or 24). OUTPUT_FILE must end in .ecdc. If not provided it will be the same as INPUT_FILE, replacing the extension with .ecdc. In order to use the model operating at 48 kHz on stereophonic audio, use the --hq flag. The -f flag is used to force overwrite an existing output file. Use the --lm flag to use the pretrained language model with entropy coding (expect it to be much slower).

If the sample rate or number of channels of the input doesn't match that of the model, the command will automatically resample / reduce channels as needed.

Decompression

encodec [-f] [-r] ENCODEC_FILE [OUTPUT_WAV_FILE]

Given a .ecdc file previously generated, this will decode it to the given output wav file. If not provided, the output will default to the input with the .wav extension. Use the -f file to force overwrite the output file (be carefull if compress then decompress, not to overwrite your original file !). Use the -r flag if you experience clipping, this will rescale the output file to avoid it.

Compression + Decompression

encodec [-r] [-b TARGET_BANDWIDTH] [-f] [--hq] [--lm] INPUT_FILE OUTPUT_WAV_FILE

When OUTPUT_WAV_FILE has the .wav extension (as opposed to .ecdc), the encodec command will instead compress and immediately decompress without storing the intermediate .ecdc file.

Extracting discrete representations

The EnCodec model can also be used to extract discrete representations from the audio waveform.

from encodec import EncodecModel
from encodec.utils import convert_audio

import torchaudio
import torch

# Instantiate a pretrained EnCodec model
model = EncodecModel.encodec_model_24khz()
# The number of codebooks used will be determined bythe bandwidth selected.
# E.g. for a bandwidth of 6kbps, `n_q = 8` codebooks are used.
# Supported bandwidths are 1.5kbps (n_q = 2), 3 kbps (n_q = 4), 6 kbps (n_q = 8) and 12 kbps (n_q =16) and 24kbps (n_q=32).
# For the 48 kHz model, only 3, 6, 12, and 24 kbps are supported. The number
# of codebooks for each is half that of the 24 kHz model as the frame rate is twice as much.
model.set_target_bandwidth(6.0)

# Load and pre-process the audio waveform
wav, sr = torchaudio.load("<PATH_TO_AUDIO_FILE>")
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.unsqueeze(0)

# Extract discrete codes from EnCodec
with torch.no_grad():
    encoded_frames = model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1)  # [B, n_q, T]

Note that the 48 kHz model processes the audio by chunks of 1 seconds, with an overlap of 1%, and renormalizes the audio to have unit scale. For this model, the output of model.encode(wav) would a list (for each frame of 1 second) of a tuple (codes, scale) with scale a scalar tensor.

Installation for development

This will install the dependencies and a encodec in developer mode (changes to the files will directly reflect), along with the dependencies to run unit tests.

pip install -e '.[dev]'

Test

You can run the unit tests with

make tests

FAQ

Please check this section before opening an issue.

Out of memory errors with long files

We do not try to be smart about long files, and we apply the model at once on the entire file. This can lead to a large memory usage and result in the process being killed. At the moment we will not support this use case.

Bad interactions between DistributedDataParallel and the RVQ code

We do not use DDP, instead we recommend using the routines in encodec/distrib.py, in particular encodec.distrib.sync_buffer and encodec.distrib.sync_grad.

Citation

If you use this code or results in your paper, please cite our work as:

@article{defossez2022highfi,
  title={High Fidelity Neural Audio Compression},
  author={DΓ©fossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}

License

The code in this repository is released under the MIT license as found in the LICENSE file.

encodec's People

Contributors

0xflotus avatar adefossez avatar eltociear avatar lwprogramming avatar vaibhavs10 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

encodec's Issues

Trying to compress what's the formel

❓ Questions

Sitting on Linux mint with python 3.8
In the terminal location EnCodec map, got a audio file sound.mp3

What is the terminal command to get it funktion?
I can't figur out that...
Can you help?

Non-free LICENSE

Non-commercial (NC) clauses are non-free as they are not free software according to the FSF, open source according to the OSI, or free culture according to Freedom Defined. I would recommend using CC-BY-SA-4.0, CC-BY-4.0, or CC-0-1.0 instead which are free culture Creative Commons licenses.

EOFError encounted for some special audio lengths

πŸ› Bug Report

EOFError encounted for some special audio lengths.
The reason is that when calculating the number of decompressed frames, the wrong order of operations resulted in floating point errors. E.g.:

math.ceil(53760 / 24000 * 75) evals to 169
math.ceil(53760 * 75 / 24000) evals to 168

frame_length = int(math.ceil(this_segment_length / model.sample_rate * model.frame_rate))

This line of code should be changed to
frame_length = int(math.ceil(this_segment_length * model.frame_rate / model.sample_rate ))

To Reproduce

Create a .wav file with duration 2.24s and content all zeros.

import numpy as np
from scipy.io import wavfile
wavfile.write('bug.wav', 24000, data=np.zeros(53760, dtype=np.int16))

Run encodec compress and decompress

import subprocess
subprocess.run('encodec bug.wav bug.encodec.wav'.split())

gives error message:

Traceback (most recent call last):
  File "/home/chenjiasheng/.local/bin/encodec", line 33, in <module>
    sys.exit(load_entry_point('encodec', 'console_scripts', 'encodec')())
  File "/mnt/d/code/encodec/encodec/__main__.py", line 117, in main
    out, out_sample_rate = decompress(compressed)
  File "/mnt/d/code/encodec/encodec/compress.py", line 185, in decompress
    return decompress_from_file(fo, device=device)
  File "/mnt/d/code/encodec/encodec/compress.py", line 147, in decompress_from_file
    raise EOFError("The stream ended sooner than expected.")
EOFError: The stream ended sooner than expected.

Your Environment

repo version:

commit c79ba28c9199494d106d2c7f56006260528d7b16 (HEAD -> main, origin/main, origin/HEAD)
Author: Alexandre DΓ©fossez <[email protected]>
Date:   Tue Jan 24 14:07:56 2023 +0100

Could this be used to compare audio similarity?

❓ Questions

I'm curious how to extract embeddings, and if that's the output of the compress function / command line tool, and whether that could be used to compare, via cosine similarity, how similar 2 audio files are?

Real-time usage example, and permissive licensing question

❓ Questions

Thanks for releasing this - very exciting work! I have two questions:

  • Do you have examples for real-time usage, or is it currently only set up for conversion of pre-recorded audio files at this time?
  • Lyra V2 is permissively licensed (Apache-2.0), and I'm in the process of getting an open-source demo of it working on the web so that people can use it in their web applications. Would you consider using a permissive license (e.g. CC-BY/MIT/Apache) so that your work can more broadly benefit the open source ecosystem? I'd love to create a JS package for this that anyone can use in their web app.

Could this be used for a better quality audiolm

❓ Questions

within the soundstream paper, the google team used the residual vector quantization model to generate music.

I was wondering since the architecture is very similiar if you guys have thought about using it to generate music.

I've tried openai's jukebox and the output is fairly noisy.

Kudos on this great work!

About audio quality evaluation

❓ Questions

Thank you for nice work.

I have some question about objective evaluation metrics.

  1. Are these metrics (SI-SNR and ViSQOL) consistent with audio quality perceptually?

I know that it is very difficult to evaluate the audio quality. So I'm so curious how to evaluate the model during ablation studies or during training.

  1. MS-STFT discriminator (Complex) VS MS-STFT discriminator (Real)

How about the quality of the model with MS-STFT discriminator using only real value? It would be appreciated if you could share such information.

Thank you!

Pre-trained Discriminator ??

❓ Questions

Do I need to use a pre-trained discriminator or I can use an un-trained discriminator to calculate adversarial losses??

Using the casual model with 24 kHz results in the process being "Killed"

πŸ› Bug Report

Using the casual model at 24 kHz results in the process being killed after ~10 seconds. It doesn't matter if the language model --lm is used or not. The 48 kHz mode works as expected.

To Reproduce

I've used the following command and got these results:
python3 -m encodec -r -b 1.5 --lm '/home/user/Downloads/Audiofile.wav' '/home/user/Downloads/Audiofile_EnCodec.wav';

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /home/user/checkpoints/encodec_24khz-d7cc33bc.th 100.0%

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_lm_24khz-1608e3c0.th" to /home/user/checkpoints/encodec_lm_24khz-1608e3c0.th 100.0%

Killed

user@user:~$

I even deleted the checkpoints folder to make sure that this isn't an issue. I've tested with a 16 and 24 bit PCM WAV with mono channel. Using --hq and --lm together works as expected so I believe the 24 kHz is the culprit.

Expected behavior

It should not kill the process and result in a properly encoded and decoded file.

Actual Behavior

It kills the process.

Your Environment

  • Python and PyTorch version: Python 3.11.0, PyTorch: 1.13.0
  • Operating system and version (desktop or mobile): Ubuntu 24.04.01 LTS
  • Hardware (gpu or cpu, amount of RAM etc.): Virtual machine (AMD 5900X, RTX 3080, 16 GB RAM assigned to the VM)

Low quality audio in the demo clip

πŸ™‹β€β™‚οΈ Suggestion

I noticed that the audio on the linked demo (final.mp4) sounds very compressed, even the original part - even on a phone speaker, I can hear a difference to test_48k.wav, not to mention with headpones

I'm aware this is NOT a finished product, but maybe something above 128kbps AAC would be better for comparing audio compression

Reconstruction Loss

❓ Questions

Hello, my question is about reconstruction loss in frequency domain, in a paragraph 3.4 it is stated that you use "mel-spectrogram using a normalized STFT", what type of normalization is mentioned here? Is it sufficient to use normalized flag of
torchaudio.transforms.MelSpectrogram which normalizes "by magnitude after stft"?
Also in practice stft loss is sometimes computed via log mel-spectrogram for better convergence, so I want to clarify, in your implementation, S_i from formula 1 is a mel-spectrogram or log mel-spectrogram?

what is the ECDC File?

❓ Questions

Hey thank you for sharing your work, sorry for asking what seems like a simple question, can you correct the below two statements if they were wrong?
ecdc file can't be used directly to be played and must be decompressed first in order to output a wav file that can be played?
ecdc would be like a zip file where it holds ur file but you can't actually use it until you decompress it?
are the above two statements correct?

get_bandwidth_per_quantizer is incorrect.

πŸ› Bug Report

    def get_bandwidth_per_quantizer(self, sample_rate: int):
        """Return bandwidth per quantizer for a given input sample rate.
        """
        return math.log2(self.bins) * sample_rate / 1000

it should be

        return math.log2(self.bins) * sample_rate / 1000 / self.hop_length

How backward balancer when using huggingface accelerate

❓ Questions

when training encodec using huggingface's accelerate package, can't using balancer?

this is part of my training script

            self.balancer._set_losses_and_input(
                losses={'t': recon_loss, 'f': m_recon_loss, 'g': ads_loss, 'feat': rfm_loss},
                input=output
            )
            # self.balancer.backward()
            self.accelerator.backward(self.balancer)

and i change Balancer little

    def __mul__(self, other):
        for name, loss in self.losses.items():
            self.losses[name] = loss * other
        
    
    def __truediv__(self, other):
        for name, loss in self.losses.items():
            self.losses[name] = loss / other


    def _set_losses_and_input(self, losses: tp.Dict[str, torch.Tensor], input: torch.Tensor):
        self.losses = losses
        self.input = input
        
    @property
    def metrics(self):
        return self._metrics

    def backward(self):
        losses = self.losses
        input = self.input
        
        norms = {}
        grads = {}
        for name, loss in losses.items():
            grad, = autograd.grad(loss, [input], retain_graph=True)
            if self.per_batch_item:
                dims = tuple(range(1, grad.dim()))
                norm = grad.norm(dim=dims).mean()
            else:
                norm = grad.norm()
            norms[name] = norm
            grads[name] = grad

        count = 1
        if self.per_batch_item:
            count = len(grad)
        avg_norms = average_metrics(self.averager(norms), count)
        total = sum(avg_norms.values())

        self._metrics = {}
        if self.monitor:
            for k, v in avg_norms.items():
                self._metrics[f'ratio_{k}'] = v / total

        total_weights = sum([self.weights[k] for k in avg_norms])
        ratios = {k: w / total_weights for k, w in self.weights.items()}

        out_grad: tp.Any = 0
        for name, avg_norm in avg_norms.items():
            if self.recale_grads:
                scale = ratios[name] * self.total_norm / (self.epsilon + avg_norm)
                grad = grads[name] * scale
            else:
                grad = self.weights[name] * grads[name]
            out_grad += grad

        input.backward(out_grad)

Inference using a GPU

❓ Questions

Hello, my question is the following, i would like to know if it's possible in any way to use cuda acceleration to process compressing and decompressing using Encodec? from what i've been reading i couldn't find anything that mentioned inference by gpu.. so i wanted to know if it's possible to use the gpu, that would be even more productive and useful, awaiting an answer, thank you in advance, att. Lucas.

Questions about the pre-trained language model

❓ Questions

Thanks for the great work and the shared code! I have some questions about the pre-trained transformer language model:

Could you explain more details about the supervision for training the transformer(shown as L_l in Fig 1 in your paper)? My understanding is that you use a pre-trained language model and train some linear layers to model the distribution of codewords for each frame, but is there any other supervision for modeling the distribution or is the transformer also joint optimized with the whole encoder and decoder?

Looking forward for your reply!
Snipaste_2022-11-06_16-42-53

Real-world Balancer usage question

❓ Questions

Hi, first of all, thanks for sharing the code! It's quite well-written.

I've been trying to understand the practical use of the Balancer class.

In pseudo-code, what I understood from the documentation is that my use of it in a real training loop should look something like this:

z = encoder(input)
reconstruction = decoder(z)
loss_l1 = F.l1_loss(reconstruction, input)
loss_l2 = F.l2_loss(reconstruction, input)
balancer.backward({'l1' : loss_l1, 'l2' : loss_l2}, reconstruction)

So far so good?

Assuming I understood correctly, I still run into an issue once I add a loss term on the latent representation (z) in the above. e.g.

z = encoder(input)
reconstruction = decoder(z)
loss_l1 = F.l1_loss(reconstruction, input)
loss_l2 = F.l2_loss(reconstruction, input)
loss_latent = some_regularization_loss(z)
balancer.backward({'l1' : loss_l1, 'l2' : loss_l2, 'z' : some_regularization_loss}, reconstruction)

Now, the loss on z does not depend on the reconstruction so this won't work. An alternative is to pass z to balancer.backward() however that comes with the cost of backpropagating through the decoder multiple times. What did you do regarding the quantization loss?

Questions about dataset mixture strategy

❓ Questions

Thanks for your amazing work!
I have a question about your dataset mixing strategy. Does mixture here refer to:

  1. a combination of audios from different datasets so no change per training audio, or
  2. a superposition of different audio tracks, so each training audio contains multiple tracks from multiple datasets?

image

Number of codebooks and calculation of bitrates

❓ Questions

I don't understand how you come to the smallest bitrate of 1.5 kbps for the 24kHz model:

If I understand correctly, we take a multiple of 4 number of codebooks (4, 8, 12, ... so 4 would be the minimum), and we have 10 bits per codebook (2^10 = 1024 entries), and for the 24kHz model 75 latent codes per second, giving us the smallest possible bit rate:
4 * 10 bits * 75 1/s = 3kbps

However, both the paper and the README state that the lowest bitrate is 1.5kbps.
Looking at the bitrate progression (1.5, 3, 6, 12, 24), which doubles at each step, wouldn't that rather correspond to 2, 4, 8, 16, 32 codebooks being used? Maybe I am just misinterpreting or missing something, could you please clarify this point?

Channel-mismatch `RuntimeError` when extracting embedding with 24kHz model

πŸ› Bug Report

Following the "Extracting discrete representations" section in README, I tried to extract the encoded embedding myself. However, running the exact code snippet gave me an error: RuntimeError: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 144006] to have 1 channels, but got 2 channels instead.

To Reproduce

from encodec import EncodecModel
from encodec.utils import convert_audio

import torchaudio
import torch

# Instantiate a pretrained EnCodec model
model = EncodecModel.encodec_model_24khz()
model.set_target_bandwidth(6.0)

# Load and pre-process the audio waveform
wav, sr = torchaudio.load("test.wav")
wav = wav.unsqueeze(0)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)

# Extract discrete codes from EnCodec
with torch.no_grad():
    encoded_frames = model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1)  # [B, n_q, T]

where test.wav is any WAV file. I tried with one on the sample page.

Expected behavior

I should be able to get the representation in [B, n_q, T] as described in the code itself.

Actual Behavior

Full traceback:

Traceback (most recent call last):
  File "/home/ubuntu/test.py", line 18, in <module>
    encoded_frames = model.encode(wav)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/model.py", line 144, in encode
    encoded_frames.append(self._encode_frame(frame))
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/model.py", line 161, in _encode_frame
    emb = self.encoder(x)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/seanet.py", line 144, in forward
    return self.model(x)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/conv.py", line 210, in forward
    return self.conv(x)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/conv.py", line 120, in forward
    x = self.conv(x)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 144006] to have 1 channels, but got 2 channels instead

Your Environment

  • Python and PyTorch version: Python 3.10.8 (conda 22.11.1), PyTorch 1.13.0 with CUDA 11.7
  • Operating system and version (desktop or mobile): Ubuntu 22.04.1 on WSL2 (Windows 10 Build 19045)
  • Hardware (gpu or cpu, amount of RAM etc.): RTX 2070 SUPER with 32 GB RAM

VQ code vectors

❓ Questions

Thanks for the nice paper/work!

I've a question:
How do I print the VQ code vectors of EnCodec?

the definition of loss_l

❓ Questions

it seems the definition of 'loss_l' in figure 1 that connected vq and transofomer in the quantizer block is not described in the paper.

it there some descriptions of that? thank.

SoundStream improved reimplementation

Thanks for publishing this! In the encodec paper you write

For fair evaluation, we also compare EnCodec to our reimplementation of
SoundStream (Zeghidour et al., 2021). [...] Finally, we compare EnCodec to the
SoundStream model from the official implementation available in Lyra 2 1 at 3.2 kbps and 6 kbps on audio
upsampled to 32 kHz. We also reproduced a version of SoundStream (Zeghidour et al., 2021) with minor
improvements.
Namely, we use the relative feature loss introduce in Section 3.4, and layer normalization
(applied separately for each time step) in the discriminators, except for the first and last layer, which improved
the audio quality during our preliminary studies.

And on https://ai.honu.io/papers/encodec/samples.html you show samples of this reimplementation.
Could you share the source code of your SoundStream reimplementation so this work can be reproduced?

Entropy coding

❓ Questions

Hi, thank you for a great work.

(I)
I could not figure out necessity of predicting codebook logits via transformer.
Why could not we use empirical distribution of codebook usage ( frequencies ) over a validation set?
I feel like I am missing something here.

(II)
Also, [3] and your work show that Entropy coding does not improve much. While [2] demonstrates significantly better results (excluding latency most likely). Also, most of the literature on Neural Image codecs shows that most of the gains are achieved from Entropy coding. Any comments on that? Do you think different implementation is required to obtain higher compression or it is not that important for Audio domain?

(III)
Also, I am writing on behalf of a small non commercial, independent research group. We are interested to work in this direction further, if there are some possibilities to collaborate that would be wonderful. We have some ideas and resources to test them. Your expertise could help us to save some time.

@adefossez.
Actually, I am sort of a fan of yours since speech de-noising paper [1] and later DiffQ. Not surprised you used it in VQ-VAE. RNNs within auto-encoders look like a signature move, not sure if it is your idea though. ))

[1] Real Time Speech Enhancement in the Waveform Domain
[2] LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
[3] SoundStream: An End-to-End Neural Audio Codec

Invalid file: WindowsPath of encodec using Windows OS.

πŸ› Bug Report

When I tried to encode from WAV to ECDC file, python gave me an error for invalid file using Windows OS. On Linux, works well, but only Windows OS did not work.

To Reproduce

  1. Install encodec using pip install .
  2. Install pysoundfile using pip
  3. Encode from uncompressed WAV to encodec (.ecdc)

Expected behavior

Encode success from WAV to ECDC.

Actual Behavior

Saying invalid file for trying find WAV file

Your Environment

The error code is for Windows OS only:

encodec -b 24 -r MSESTONIA.wav MSESTONIA.ecdc
Traceback (most recent call last):
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\Scripts\encodec.exe\__main__.py", line 7, in <module>
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\encodec\__main__.py", line 109, in main
    wav, sr = torchaudio.load(args.input)
  File "C:\Users\marti\AppData\Roaming\Python\Python310\site-packages\torchaudio\backend\soundfile_backend.py", line 205, in load
    with soundfile.SoundFile(filepath, "r") as file_:
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\soundfile.py", line 740, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\soundfile.py", line 1263, in _open
    raise TypeError("Invalid file: {0!r}".format(self.name))
TypeError: Invalid file: WindowsPath('MSESTONIA.wav')
  • Python and PyTorch version: Python 3.10.8 and Pytorch 1.13.0+cu117

  • Operating system and version (desktop or mobile): Windows 11 (desktop)

  • Hardware (gpu or cpu, amount of RAM etc.): NVIDIA RTX 3060 Laptop GUI, 16 GB RAM.

  • Martin Eesmaa

incorrect time to expire codes

πŸ› Bug Report

In the current implementation the following incorrect behavior occurs:
some embedding was not chosen in the previous steps so it will be expired here, but it was chosen right on that step a couple lines above, but since self.cluster_size will be changed only in following lines, this embedding will be expired, but it should not.

This can be easily checked by adding these 2 lines
expared_but_taken = (self.cluster_size < self.threshold_ema_dead_code) & (embed_onehot.sum(0) > 0)
assert torch.any(expared_but_taken).
somewhere here

I think the correct code should be like in vector-quantize-pytorch repo, namely this line.

Motivation behind `layer_state` in `StreamingTransformerEncoder`

❓ Questions

I would like to hear some more about your motivation behind your usage of layer_state in the StreamingTransformerEncoder.

My understanding so far is that, for each layer, the previous input x_past is concatenated with x for the keys and values, not for the queries. This effectively means that the matmul between queries and keys is not just attending to itself but also to part of the inputs of the previous input x_past.

I'm not entirely sure how to interpret this and this maybe due to me not being able to introspect your training strategy. To my understanding, x and x_past should be independent token sequences, in this case it seems strange to allow the transformer to attend to a concatenation of these sequences. Alternatively, x and x_past originate from the same audio clip, in this case I don't understand why you wouldn't just increase the context length explicitly.

I tried to find other transformer implementations that do something similar and the only thing that came close to this is Transformer XL. There is a major difference however since they propagate the output of the transformer layer stack to the next step, your implementation propagates the input.

I may be missing something entirely so please excuse my ignorance in that case, nonetheless I would really appreciate it if you can shed some light on this πŸ˜‡

Finetuning possible?

❓ Questions

Is it possible to fine-tune the model? If so, how?

Or is only the compression/decompression model provided?

Do we have training examples?

❓ Questions

Hi encodes team,

great job!

We want to reimplement the results and train a new model in a wider range of the sound dataset and add some denoise/dereverberation functionality. Could you please add some training examples?

Changing existing models and training them

❓ Questions

Hello! Could you tell my, how can I change your models (24 kHz and 48 kHz) to 16 kHz, 8 kHz, because sometime we don't use audio sample rate (24 kHz and 48 kHz).
Also, how can I train models?

Memory leak in decoding process

πŸ› Bug Report

I'm trying to evaluate Encodec for streaming audio use cases, however I noticed that the decoding step seems to accumulate memory very quickly over time. If I turn off decoding, memory usage stays constant. I looked at the code though, I'm not sure why?

To Reproduce

from encodec import EncodecModel
from encodec.utils import convert_audio

import torchaudio
import torch

# Instantiate a pretrained EnCodec model
model = EncodecModel.encodec_model_24khz()
model.set_target_bandwidth(6.0)

# Load and pre-process the audio waveform
wav, sr = torchaudio.load("CantinaBand60.wav")
wav = wav.unsqueeze(0)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)

# Extract discrete codes from EnCodec
frames = []
for offset in range(0, 1440000, 480):
    print(offset)

    encoded_frames = model._encode_frame(wav[:, :, offset: offset + 480])

    frames.append(model._decode_frame(encoded_frames))

# merge the decoded frames
decoded = torch.cat(frames, dim=2)

# save the decoded audio
torchaudio.save("decoded.wav", decoded, model.sample_rate)

Expected behavior

Memory shouldn't be persisted across frames.

Actual Behavior

Memory usage grows over time eventually resulting in OOM.

Wrong bar chart on announcement

This report is regarding the announcement of this project on the blog-post. I know this is the wrong place to report it, but I don't know where the right place would be.

The chart on the page is misleading and wrong. It shows an x-axis that indicates that the 10Γ— gain is for mp3. Either way the bar-length has to be swapped or the caption has to change:

image

About the stability of the VQ based approach for codec

❓ Questions

Thanks for sharing the amazing speech codec. Since the Encodec and SoundStream utilized the RVQ to quantize the latent representations, I'm worried about the stability of (R)VQ.

I evaluated the Lyra2 at 9.2 kbps and the Encodec at 12 kbps with high-quality data and found there exists irregular harmonics (an example). I guess it is caused by the VQ process, do you have any view about this?

Question about bitrate choices

❓ Questions

Hello! Do you intend to train models with a target bitrate of above 24kbps? I didn't see anything in the paper, but maybe I missed it.

I'd be curious to see how 48 and 96kbps models compare to mp3s at higher bitrates.

Thanks for the great work!

A question about adversarial loss

❓ Questions

In the paper 3.4 "Discriminative Loss" section, adversarial loss is constructed as $l_g(\hat{x})=\mathbb{E}[max(0,1-D_k(\hat{x}))]$, but in the original hinge loss paper, adversarial loss is constructed as $-\mathbb{E}[D(\hat{x})]$.

So I want to know, why the adversarial loss in this paper is different from the original hinge loss?

Wrong device in average_metrics function

πŸ› Bug Report

Since average_metrics function is being called from backward that should be called from every gpu, the devise that is being created here should be equal to the current rank, otherwise torch.distributed.all_reduce will be stucked forever.

Can the decoder models run on Android and IOS ?

❓ Questions

Google's Lyra provides support for Android via TFLite but not for IOS (yet). Does Facebook's encoded provide models that can run on edge devices for both Android and IOS? If not, are there any estimated timelines for when this could be available?

Some details about RVQ code

❓ Questions

Hi, when I try to reproduce the training code based on your released part, I meet a question when I try to use multiple-GPU to train, that is, I find that https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L150 and https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L168 will cause the DDP training stop, I find the problem is this code will cause mutilple-GPU to wait each other. Thus, I delete this line code. Now, it can be trained with torch DDP. But I donot know whether this line code will influence the performance? Can you give me some advice whether this line code can be deleted?

Question about better quality models for 48 kHz

Do you have better quality models for 48 kHz?

I've been helpful if you add 32 kbps, 48 kbps, 64 kbps and 96 kbps models to compare to MP3 / AAC / Windows Media Audio.

Thanks for the support!

During Training model's loss doesn't converge

❓ Questions

I have written my custom training loop with a reconstruction loss function. But the loss doesn't converge and gets bounced around 600 and 700 (in the case of my loss func) even after 3 to 4 epochs. Can you please explain why did that happen?

I first wanted to check the loss convergence of the model only for reconstruction loss without the discriminators.

loss function code (From SoundStream paper) :
image

def L_G_rec(x, G_x, eps=1e-4):
    L = 0
    sr = 16000
    for i in range(6, 12):
        s = 2 ** i
        alpha_s = (s / 2) ** 0.5
        melspec = MelSpectrogram(sample_rate=sr, n_fft=s, hop_length=s//4, center=True, pad_mode="reflect", power=2.0, norm="slaney", onesided=True, n_mels=8, mel_scale="htk")
        S_x = melspec(x)
        S_G_x = melspec(G_x)
        loss1 = (S_x - S_G_x).abs().mean()
        loss2 = alpha_s * (((torch.log(S_x.abs() + eps) - torch.log(S_G_x.abs() + eps)) ** 2).sum(dim=-2).add(eps).abs() ** 0.5).mean()
        loss = loss1 + loss2
        L = L + loss
    return L

Training loop code :

model = EncodecModel._get_model(24)
. . . 
for epoch in range(0, EPOCH):
  print("EPOCH number = %d" % epoch)
  for i, input in enumerate(iter(dataloader)):
    with torch.autograd.set_detect_anomaly(True):

      recon_ouput = model(input)

      gen_loss = L_G_rec(input, recon_output)

      generator_optim.zero_grad()
      gen_loss.backward()
      generator_optim.step()
      if i % 50 == 0:
        print("Training generator loss = %d" % gen_loss)
      save_model(model, generator_optim)

What is the problem with the training loop? Do I need to include a discriminator for convergence?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.