kkoutini / passt Goto Github PK

View Code? Open in Web Editor NEW

294.0 5.0 50.0 625 KB

Efficient Training of Audio Transformers with Patchout

License: Apache License 2.0

Python 100.00%

pytorch audio-classification audio-tagging transformer machine-learning pattern-recognition

passt's Introduction

PaSST: Efficient Training of Audio Transformers with Patchout

This is the implementation for Efficient Training of Audio Transformers with Patchout

Patchout significantly reduces the training time and GPU memory requirements to train transformers on audio spectrograms, while improving their performance.

Patchout works by dropping out some of the input patches during training. In either an unstructured way (randomly, similar to dropout), or entire time-frames or frequency bins of the extracted patches (similar to SpecAugment), which corresponds to rows/columns in step 3 of the figure below.

Pre-trained models for Inference and embeddings extractions
- Getting the logits from the pretrained models
- Getting a pre-trained model for fine-tuning
Development environment
Getting started
- General information
- Configuring the experiment
Training on Audioset
Examples with Pre-trained models
Examples fine-tuning on downstream datasets
Citation
Contact

Pre-trained models for Inference and embeddings extractions

If you only want to use the embeddings generated by the pretrained models, use your own fine-tuning framework, or you need it only for inference, you can find a stripped down version of this repo here. The package follows HEAR 2021 NeurIPS Challenge API, and can be installed:

pip install hear21passt

This repo is a complete framework for training the models and fine-tuning pre-trained models on Audioset on downstream tasks.

Getting the logits from the pretrained models

from hear21passt.base import get_basic_model,get_model_passt
import torch
# get the PaSST model wrapper, includes Melspectrogram and the default pre-trained transformer
model = get_basic_model(mode="logits")
print(model.mel) # Extracts mel spectrogram from raw waveforms.
print(model.net) # the transformer network.

# example inference
model.eval()
model = model.cuda()
with torch.no_grad():
    # audio_wave has the shape of [batch, seconds*32000] sampling rate is 32k
    # example audio_wave of batch=3 and 10 seconds
    audio = torch.ones((3, 32000 * 10))*0.5
    audio_wave = audio.cuda()
    logits=model(audio_wave)

Getting a pre-trained model for fine tuning

from hear21passt.base import get_basic_model,get_model_passt
import torch
# get the PaSST model wrapper, includes Melspectrogram and the default pre-trained transformer
model = get_basic_model(mode="logits")
print(model.mel) # Extracts mel spectrogram from raw waveforms.

# optional replace the transformer with one that has the required number of classes i.e. 50
model.net = get_model_passt(arch="passt_s_swa_p16_128_ap476",  n_classes=50)
print(model.net) # the transformer network.


# now model contains mel + the transformer pre-trained model ready to be fine tuned.
# It's still expecting input of the shape [batch, seconds*32000] sampling rate is 32k

model.train()
model = model.cuda()

Development environment

If you want to use the same environment as in the paper, you can follow the instructions below.

Setting up the development experiments environment

For training models from scratch or fine-tuning using the same setup as in the paper:

If needed, create a new environment with python 3.8 and activate it:

conda create -n passt python=3.8
conda activate passt

Install pytorch build that suits your system. For example:

conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch

Install the requirements:

pip install -r requirements.txt

Setting up using the exported conda environment

Alternatively, you can use the exported conda environment environment.yml to create the environment.

For setting up Mamba is recommended since it works faster than conda:

conda install mamba -n base -c conda-forge

Now you can import the environment from environment.yml

mamba env create -f environment.yml

Now you have an environment named ba3l.

Checking the environment

In order to check if your environment matched the environment we used in our runs, please check the environment.yml and pip_list.txt files, which were exported using:

conda env export --no-builds | grep -v "prefix" > environment.yml
pip list > pip_list.txt

Getting started

If you want to use your setup and only use the models from this repo, you can get the models train them from scratch or fine-tune them on your own dataset as explained above Pre-trained models for Inference and embeddings extractions. The rest of this section explains using this repo for training and fine-tuning the models. For that, first you need to set up the development environment as explained above.

General information

The repo is built using sacred for experiment management and configuration, pytorch-lightning for training, and wandb for logging.

Each dataset has a main experiment file such as ex_audioset.py and ex_openmic.py and a dataset folder. The experiment file contains the main training and validation logic. The dataset folder contains the dataset specific code needed to download, preprocess and load the dataset for training.

In general, you can prob the experiment file for help, this will print the available commands and basic options:

python ex_audioset.py help

Configuring the experiment

Each experiment has a set of default configuration options, defined in the experiment file, e.g. ex_audioset.py. You can override any of the configuration using the sacred syntax. You can use the print_config command to print the configuration values without training a model:

 python ex_audioset.py print_config

You can use then use the command line interface to override any of the configuration options (sacred syntax), using with e.g.:

python ex_audioset.py with trainer.precision=16

This will train on Audioset using 16-bit precision.

The overall configurations look like this:

  ...
  seed = 542198583                  # the random seed for this experiment
  slurm_job_id = ''
  speed_test_batch_size = 100
  swa = True
  swa_epoch_start = 50
  swa_freq = 5
  use_mixup = True
  warm_up_len = 5
  weight_decay = 0.0001
  basedataset:
    base_dir = 'audioset_hdf5s/'     # base directory of the dataset, change it or make a link
    eval_hdf5 = 'audioset_hdf5s/mp3/eval_segments_mp3.hdf'
    wavmix = 1
    ....
    roll_conf:
      axis = 1
      shift = None
      shift_range = 50
  datasets:
    test:
      batch_size = 20
      dataset = {CMD!}'/basedataset.get_test_set'
      num_workers = 16
      validate = True
    training:
      batch_size = 12
      dataset = {CMD!}'/basedataset.get_full_training_set'
      num_workers = 16
      sampler = {CMD!}'/basedataset.get_ft_weighted_sampler'
      shuffle = None
      train = True
  models:
    mel:
      freqm = 48
      timem = 192
      hopsize = 320
      htk = False
      n_fft = 1024
      n_mels = 128
      norm = 1
      sr = 32000
      ...
    net:
      arch = 'passt_s_swa_p16_128_ap476'
      fstride = 10
      in_channels = 1
      input_fdim = 128
      input_tdim = 998
      n_classes = 527
      s_patchout_f = 4
      s_patchout_t = 40
      tstride = 10
      u_patchout = 0
      ...
  trainer:
    accelerator = None
    accumulate_grad_batches = 1
    amp_backend = 'native'
    amp_level = 'O2'
    auto_lr_find = False
    auto_scale_batch_size = False
    ...

There are many things that can be updated from the command line. In short:

All the configuration options under trainer are pytorch lightning trainer api. For example, to turn off cuda benchmarking add trainer.benchmark=False to the command line.
wandb is the wandb configuration. For example, to change the wandb project wandb.project="test_project" to the command line.
models.net are the PaSST (or the chosen NN) options. Examples: models.net.u_patchout, models.net.s_patchout_f models.net.s_patchout_t control the unstructured patchout and structured patchout over frequency and time. input_fdim and input_tdim are the input spectrogram dimensions over frequency and time. models.net.fstride and models.net.tstride are the strides of the input patches over frequency and time, setting these to 16 means no patch overlap.
models.mel are the preprocessing options (mel spectrograms). mel.sr is the sampling rate, mel.hopsize is the hop size of the STFT window, mel.n_mels is the number of mel bins, mel.freqm and mel.timem are the frequency and time masking parameters of spec-augment.

There are many pre-defined configuration bundles (called named_configs) in config_updates.py. These include different models, setups etc... You can list these configurations with:

python ex_audioset.py print_named_configs

For example, passt_s_20sec is a configuration bundle that sets the model to PaSST-S pre-trained on Audioset, and accepts up to 20 second clips.

Training on Audioset

Download and prepare the dataset as explained in the audioset page

The base PaSST model can be trained for example like this:

python ex_audioset.py with trainer.precision=16  models.net.arch=passt_deit_bd_p16_384 -p

For example using only unstructured patchout of 400:

python ex_audioset.py with trainer.precision=16  models.net.arch=passt_deit_bd_p16_384  models.net.u_patchout=400  models.net.s_patchout_f=0 models.net.s_patchout_t=0 -p

Multi-gpu training can be enabled by setting the environment variable DDP, for example with 2 gpus:

 DDP=2 python ex_audioset.py with trainer.precision=16  models.net.arch=passt_deit_bd_p16_384 -p -c "PaSST base 2 GPU"

Examples with Pre-trained models

Please check the releases page, to download pre-trained models. In general, you can get a pretrained model on Audioset using

from models.passt import get_model
model  = get_model(arch="passt_s_swa_p16_128_ap476", pretrained=True, n_classes=527, in_channels=1,
                   fstride=10, tstride=10,input_fdim=128, input_tdim=998,
                   u_patchout=0, s_patchout_t=40, s_patchout_f=4)

this will get automatically download pretrained PaSST on audioset with with mAP of 0.476. the model was trained with s_patchout_t=40, s_patchout_f=4 but you can change these to better fit your task/ computational needs.

There are several pretrained models availble with different strides (overlap) and with/without using SWA: passt_s_p16_s16_128_ap468, passt_s_swa_p16_s16_128_ap473, passt_s_swa_p16_s14_128_ap471, passt_s_p16_s14_128_ap469, passt_s_swa_p16_s12_128_ap473, passt_s_p16_s12_128_ap470. For example, In passt_s_swa_p16_s16_128_ap473: p16 mean patch size is 16x16, s16 means no overlap (stride=16), 128 mel bands, ap473 refers to the performance of this model on Audioset mAP=0.479.

In general, you can get a pretrained model using:

from models.passt import get_model
passt = get_model(arch="passt_s_swa_p16_s16_128_ap473", fstride=16, tstride=16)

Using the framework, you can evaluate this model using:

python ex_audioset.py evaluate_only with  trainer.precision=16  passt_s_swa_p16_s16_128_ap473 -p

Ensemble of these models are provided as well: A large ensemble giving mAP=.4956

python ex_audioset.py evaluate_only with  trainer.precision=16 ensemble_many

An ensemble of 2 models with stride=14 and stride=16 giving mAP=.4858

python ex_audioset.py evaluate_only with  trainer.precision=16 ensemble_s16_14

As well as other ensembles ensemble_4, ensemble_5

Examples of fine-tuning on downstream datasets

Citation

The citation to the accepted paper in Interspeech 2022:

@inproceedings{koutini22passt,
  author       = {Khaled Koutini and
                  Jan Schl{\"{u}}ter and
                  Hamid Eghbal{-}zadeh and
                  Gerhard Widmer},
  title        = {Efficient Training of Audio Transformers with Patchout},
  booktitle    = {Interspeech 2022, 23rd Annual Conference of the International Speech
                  Communication Association, Incheon, Korea, 18-22 September 2022},
  pages        = {2753--2757},
  publisher    = {{ISCA}},
  year         = {2022},
  url          = {https://doi.org/10.21437/Interspeech.2022-227},
  doi          = {10.21437/Interspeech.2022-227},
}

Contact

The repo will be updated, in the meantime if you have any questions or problems feel free to open an issue on GitHub, or contact the authors directly.

passt's People

Contributors

Stargazers

Watchers

passt's Issues

音频事件检测

你好请问代码适合做ms级别的音频事件定位么？

Pre-trained models on ESC-50

Hi Khaled,

I want to use the following checkpoints.

Just to make sure, when you say pre-trained models on ESC-50 in this case, you mean (in chronological order):

Using a model trained on ImageNet
To then train it on Audioset
And later fine-tune on it ESC-50

If so, how can I know which config of default_cfgs in model.py was used for these checkpoints above?

Also, have you pre-trained on all ESC-50 folds at once? During a cross-validation in machine learning with sklearn's GridSearch, the model is ultimately refit on all folds with the best hyperparams config found. Shouldn't we do the same in Deep Learning?

Cheers

Antoine

Inference Issue

Hello,

First of all, thank you for the awesome and very well-written paper and repo.

I currently want to use the embedding of these pre-trained models for my project. The following is the inference code I wrote for fsd50k.

import torch
import numpy as np
import librosa
from hear21passt.base import get_basic_model, get_model_passt, get_scene_embeddings, get_timestamp_embeddings, load_model

model = get_basic_model(mode="logits")
model.net = get_model_passt(arch="fsd50k-n",  n_classes=200, fstride=16, tstride=16)
model.eval()
model = model.cuda()

audio, sr = librosa.load("../dataset/fsd50k/mp3/FSD50K.dev_audio/102863.mp3", sr = 32000, mono=True)
audio = torch.from_numpy(np.array([audio]))
audio_batch = torch.cat((audio, audio, audio), 0).cuda()

embed = get_scene_embeddings(audio_batch, model)
model(audio_batch)

When I do embed.shape I get torch.Size([3, 1295]), so I basically get what I need already. But, I double check to try get the logit through model() and it give me the following error:

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_13937/329924078.py in <module>
----> 1 model(audio_batch)

/data/scratch/ngop/.envs/vqgan2/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/data/scratch/ngop/src/hear21passt/hear21passt/wrapper.py in forward(self, x)
     36         specs = self.mel(x)
     37         specs = specs.unsqueeze(1)
---> 38         x, features = self.net(specs)
     39         if self.mode == "all":
     40             embed = torch.cat([x, features], dim=1)

/data/scratch/ngop/.envs/vqgan2/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/data/scratch/ngop/src/hear21passt/hear21passt/models/passt.py in forward(self, x)
    525         if first_RUN: print("x", x.size())
    526 
--> 527         x = self.forward_features(x)
    528 
    529         if self.head_dist is not None:

/data/scratch/ngop/src/hear21passt/hear21passt/models/passt.py in forward_features(self, x)
    472             time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]]
    473             if first_RUN: print(" CUT time_new_pos_embed.shape", time_new_pos_embed.shape)
--> 474         x = x + time_new_pos_embed
    475         if first_RUN: print(" self.freq_new_pos_embed.shape", self.freq_new_pos_embed.shape)
    476         x = x + self.freq_new_pos_embed

RuntimeError: The size of tensor a (135) must match the size of tensor b (99) at non-singleton dimension 3

However, I tried a few other audios in fsd50k, and some were able to give me logits and the correct prediction, but some just give errors like this. What could the issues be? Do I need to worry about it, or could I use the embedding? My other question is whether the input batch is fixed? For the model I loaded, I have to input the batch of 3 audio. Is there a way for me to input a different batch?

Inference on AudioSet

Thank you for the code and inference script.
I understand that the PaSST model has been trained on AudioSet with sampling rate of 32kHz.
I am trying to make inference using the pre trained model.
Could you please let me know if I have to retrain the model with AudioSet (sampling rate of 16kHz) data to use it to make inference on 16kHz data or is there any other way?

Also, curious to know why did you use 32kHz instead of already available 16kHz AudioSet data?

Thanks in advance.

Where is input normalization applied?

Hi Khaled,

Could you please point me to where normalization is applied to inputs? (for the esc50 case or any other cases)

I am talking about channels mean and std such as written in the code below:

IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)
IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5)
IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)


def _cfg(url='', **kwargs):
    return {
        'url': url,
        'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': None,
        'crop_pct': .9, 'interpolation': 'bicubic', 'fixed_input_size': True,
        'mean': IMAGENET_INCEPTION_MEAN, 'std': IMAGENET_INCEPTION_STD,
        'first_conv': 'patch_embed.proj', 'classifier': 'head',
        **kwargs
    }

If the first training was done on ImageNet, then I guess ImageNet channels mean and std are applied to Audiosets input when finetuning on this dataset, and also to ESC50 inputs if further finetuning on this one. Am I correct?

Again, I am trying to refactor your code to have only the interesting portion for us fit into our already existing training scripts. But I don't see where those means and standard deviations are applied, whether in the dataset or in AugmentMel.

Thanks a lot (again)

Antoine

time_new_pos_embed

Hi Khaled,

I am playing with your code a bit and I struggle to understand these few lines below:

        # Adding Time/Freq information
        if first_RUN: print(" self.time_new_pos_embed.shape", self.time_new_pos_embed.shape)
        time_new_pos_embed = self.time_new_pos_embed
        if x.shape[-1] < time_new_pos_embed.shape[-1]:
            if self.training:
                toffset = torch.randint(1 + time_new_pos_embed.shape[-1] - x.shape[-1], (1,)).item()
                if first_RUN: print(f" CUT with randomoffset={toffset} time_new_pos_embed.shape",
                                    time_new_pos_embed.shape)
                time_new_pos_embed = time_new_pos_embed[:, :, :, toffset:toffset + x.shape[-1]]
            else:
                time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]]
            if first_RUN: print(" CUT time_new_pos_embed.shape", time_new_pos_embed.shape)
        else:
            warnings.warn(
                f"the patches shape:{x.shape} are larger than the expected time encodings {time_new_pos_embed.shape}, x will be cut")
            x = x[:, :, :, :time_new_pos_embed.shape[-1]]
        x = x + time_new_pos_embed

Especially the slicing of time_new_pos_embed with toffset. I understand the slicing in the first else and the second else but I don't get why the slicing is randomized at training. If it's a position embedding surely it shouldn't be random right?

Many thanks in advance.

Antoine

I have a problem. why convert wav to mp3?

I have a problem. why convert .wav to .mp3 and 32k? And what would happen if converting to 16K and use .wav file?

Changing tdim for pretrained model

Thanks for sharing such great work! I want to use the pre-trained model but changing input_tdim is giving an error. My audio clips are relatively small and hence i need a smaller input_tdim. How do I do that? The error I get is due to the pretrained layer's size not equal to the current size of the model(After using input_tdim)

Which config can reproduce the results in paper?

difference of fine-tuning the pretrained models

I'm sorry to bother you. I want to ask the difference between the two ways to get pre-training models. I don't know if I understand correctly
The first is in the ''Getting a pre-trained model for fine tuning'' part. The code is

from hear21passt.base import get_basic_model,get_model_passt
import torch
# get the PaSST model wrapper, includes Melspectrogram and the default pre-trained transformer
model = get_basic_model(mode="logits")
print(model.mel) # Extracts mel spectrogram from raw waveforms.

# optional replace the transformer with one that has the required number of classes i.e. 50
model.net = get_model_passt(arch="passt_s_swa_p16_128_ap476",  n_classes=50)
print(model.net) # the transformer network.


# now model contains mel + the transformer pre-trained model ready to be fine tuned.
# It's still expecting input of the shape [batch, seconds*32000] sampling rate is 32k

model.train()
model = model.cuda()

The second is in the ''Pre-trained models'' part.

from models.passt import get_model
model  = get_model(arch="passt_s_swa_p16_128_ap476", pretrained=True, n_classes=527, in_channels=1,
                   fstride=10, tstride=10,input_fdim=128, input_tdim=998,
                   u_patchout=0, s_patchout_t=40, s_patchout_f=4)

I have two questions. Is it the first way to obtain the pre-trained model and only fine-tune the layers in transformer blocks related to num_classes? The other layers' weights will not changed?
And Is it the second way to obtain the pre-trained model will load weights of all layers and train them again ? Or are the two ways the same?

RuntimeError: The size of tensor a (2055) must match the size of tensor b (99) at non-singleton dimension 3

I use a trained model for inference and I encounter this problem when the file length is long.
Traceback (most recent call last):
File "", line 1, in
File "/home/xingyum/anaconda3/envs/ba3l/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingyum/models/PaSST/output/openmic2008/_None/checkpoints/src/hear21passt/hear21passt/wrapper.py", line 38, in forward
x, features = self.net(specs)
File "/home/xingyum/anaconda3/envs/ba3l/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingyum/models/PaSST/output/openmic2008/_None/checkpoints/src/hear21passt/hear21passt/models/passt.py", line 507, in forward
x = self.forward_features(x)
File "/home/xingyum/models/PaSST/output/openmic2008/_None/checkpoints/src/hear21passt/hear21passt/models/passt.py", line 454, in forward_features
x = x + time_new_pos_embed
RuntimeError: The size of tensor a (2055) must match the size of tensor b (99) at non-singleton dimension 3

From ViT models to audio

Hi Khaled,

In your code, there is the possibility to create a ViT architecture and load the corresponding pretrained weights (like "vit_tiny_patch16_224").

Do we agree that such architectures only work with similar size inputs (224224 for example)? If so, how did you finetune a model on Audioset that was initially trained on Imagenet (going from 224224 to 128*998 for example)? Is this procedure in some code in your repo?

I read the AST paper I guess you took inspiration from and they talk about it in some details.
I was just wondering how I would do the whole process (ImageNet -> AudioSet -> ESC50) on my end.

Thanks a lot.

Antoine

OpenMic fine-tuned model?

Do you mind releasing the OpenMic fine-tuned model? So OpenMic style predictions can be made out of the box, without any training?

Fine tuning on novel dataset

Hello. Firstly, thank you for this great work! I've already had very promising results looking at the "scene" embeddings from these models, and looking to fine tune a model on a new dataset - similar to ESC50 & others. (as a side note, using scene embeddings & a logistic regression, I'm having acceptably good results, however I'm convinced true fine tuning would be significantly better).

I'm having a bit of trouble interpreting the example scripts. Are you able to give a simple explanation of what is required for fine-tuning (e.g. the data format, directories vs JSON file, formal of labels CSV, etc)? It's quite hard to reverse engineer this from the code. I have a directory of files, and known labels, and simply want to fine tune a model on it. And once the data is in place, which functions/CLI scripts should be invoked?

Many thanks, and if I'm missing something obvious, apologies. I know the Audioset page has a few more details but it's still not crystal clear how to proceed. Cheers!

Training Logs

Hi authors, thanks for the great work!
Is there any log files in the training stage?
I didn't find it.

RuntimeError: stft requires the return_complex parameter be given for real inputs

Hello!
I am using the following code:

from hear21passt.base import get_basic_model,get_model_passt
import torch
# get the PaSST model wrapper, includes Melspectrogram and the default pre-trained transformer
model = get_basic_model(mode="logits")
print(model.mel) # Extracts mel spectrogram from raw waveforms.
print(model.net) # the transformer network.

# example inference
model.eval()
model = model.cuda()
with torch.no_grad():
    # audio_wave has the shape of [batch, seconds*32000] sampling rate is 32k
    # example audio_wave of batch=3 and 10 seconds
    audio = torch.ones((3, 32000 * 10))*0.5
    audio_wave = audio.cuda()
    logits=model(audio_wave)

I am getting the following error:

RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.

How can I solve this issue please?
Thank you!

Pretrained models config

Hi, How can I know configurations used for pre-training models?
e.g. u_patchout, s_patchout_t, s_patchout_f etc...

Thank you!

audio inference

@kkoutini
Thanks for sharing nice work. I want to know how to read an audio file and do full inference. Can you show me the example? How to do preprocess?

kaggle

excuse me,I wonder to know how should I setup PaSST on kaggle?I have tried some times,but I failed

Getting started with a custom dataset

Hi,

Thank you for your great work!

I want to use PaSST for my custom dataset, different classification task.

Are there any minimal instructions/code for running the model for a different dataset? From which file should I start?

Does PaSST support multi-channel audio wav?

Best

Changing the depth of PASST.

I want to change the depth of the transformer while finetuning the model. I am using the following command (inspired from ESC50) :

python3 ex_dcase.py with models.net.s_patchout_t=10 models.net.s_patchout_f=5 basedataset.fold=1 -p

I have already prepared the ex_dcase.py and dataset.py files for DCASE2020 dataset (inspired from ESC50 file provided by you). I have already been able to finetune the whole model once. Now I want to add a depth parameter to the commandline to run finetune script, so that I can control how many block I want to finetune on the architecture.
Currently I change the depth by changing the depth variable of the desired architecture here.
Suggest the required changes I need to make so that I can execute a command in the commandline and only finetune selective layers.

The meaning of "swa"

When use your code for training model, there is "swa": true in the config file. So, what's the meaning of "swa"?

Binarizing linear predictions

Dear authors,

Thank you for the great work!
I would like to know what is the best way to binarize the linear predicted probabilities in a way that :

0 : audio label is absent
1: audio label is present

If you have any suggestion for binarization issue , it would be great to know it.

And one more question, as I understood from the paper linear probability value for each label shows the presence of that audio label in the input audio and probability value doesn't depend on the duration of audio label happens. I mean if it happens during the very short duration or long duration. Am I right?

Another question is there any possible difference between feeding audio data that is typically 20-90 seconds long (which is not monophonic) vs slicing it in chunks or running second-by-second predictions. I would like to know is it good idea to run second-by-second prediction with Passt?

It would be great for me to get your answers to the above-mentioned questions.

Anar Sultani

Is it possible to use this project directly for a code example for instrument recognition?

Can I output the labels directly with the pretrained model or do I need to do fine-tuning for openmic-2018

The loop in the diagram

This is an amazing job! But I have a question: what does the loop in the diagram mean? In fact, I didn't find the loop operation in the paper and codes. Thanks!

setup.py

Could you add a setup.py so that the package is pip-installable?

.net and .net_swa parameters in .ckpt file

We have finetuned the passt_s_swa_p16_s16_128_ap473 model on Dcase 2020 dataset for scene classificiation. Now we are trying to use the finetuned model by loading params from ckpt file using state dictionary. But it says it has two types of params .net and.net_swa. Which params are we supposed to use for the architecture

mismatch version of pytorch-lighting and sarced

Hello,when I after running the code following,

and I run the code

but I encontered the issue:

Is this the wrong version of pytorch-lighting and sarced?When I upgrad the pytorch-lighting to the latest version,the issue is solved but the issue with sacred has not been solved.
Could you please provide some help?Thank you very much!

test my own model

Hello, I would like to ask a few questions
I see that the pre-trained models are all .pt files, and the model I trained without changing the default parameters is in the form of .ckpt. But it doesn't matter, when I use "passt_s_swa_p16_128_ap476" as a pre-training model to verify my fine-turn model, some problems arise:
First of all, checkpoint saves another batch of parameters headed by net_swa., which may be related to the use of swa in the code, but the swa used in the introduction of the pre-training model is also used. Why is there no net_swa. parameter when printing the pre-training model, so when I load my own model, there is a problem of Unexpected key(s) in state_dict. I think it may be caused by this part of the code. How to solve this problem?

In addition, I would like to ask, if a single piece of audio verifies my own model, how should the script be written?

OpenMic2018

Hello author,

Thank you for the great job!
After calling this line of code:
python ex_openmic.py with trainer.precision=16 -p -m mongodb_server:27000:audioset21_balanced -c "OpenMIC PaSST base"

I'm running into an error with the openMic classification.

Traceback (most recent calls):
File "/home/user/Desktop/PaSST/src/sacred/sacred/observers/mongo.py", line 511, in parse_mongo_db_arg
g = re.match(get_pattern(), mongo_db).groupdict()
AttributeError: 'NoneType' object has no attribute 'groupdict0'

Please what am I missing or doping wrong?

No module named 'ba3l.ingredients'

hi, i want to train the PaSST with Audioset
But when i runed "ex_audioset.py", i faced error: "No module named 'ba3l.ingredients"
I already finished setting up the environment as follow the Readme
how can i fix it

FSD50K - validating on eval data

Hi! First off, excellent work with the module. It's showing great results so far in my project.
I'm having trouble, however, with an experiment. I am trying to fine-tune and train the model on subsets (3k samples for training and validating) and have created hdf5 files for that. The paths in config.basedatasets are corrected for this.

The problem that I run into is that when I run the command:
python ex_fsd50k.py evaluate_only with passt_s_swa_p16_s16_128_ap473
the program uses the evaluation data for validation. I confirmed this by making a change in fsd50k/dataset.py:

def __len__(self):
    if self.hdf5_file == "audioset_hdf5s/mp3/FSD50K.eval_mp3.hdf":
        return 300
    return self.length

which affects the number of validation batches.

I really don't understand what is going on. Isn't the model supposed to validate on the validation data?

Kindest regards, Ludvig.

Fixing weights for fine-tuning?

Hi Khaled,

Do you fix weights of embeddings and attention blocks after loading pretrained checkpoints for finetuning, or is it just an initialization and they are further updated through finetuning?
I can't really find the answer in your code.

Many thanks.

Installation issues

Hi, I am trying to install and run the PaSST-S method on my own data but I get this error when I run python ex_audioset.py help

File "ex_audioset.py", line 16, in <module>
    from helpers.mixup import my_mixup
ModuleNotFoundError: No module named 'helpers.mixup'

EOF (End Of File) Error on num_workers>0

I am trying to finetune the model on DCASE2020 dataset. I have prepared the sample ex_dcase.y file and dataset.py file inspired from esc50 dataset but whenever I increase the num_workers in train or test dataloader, I recieve the EOF File error. Basically 2 errors arise namely:
Traceback (most recent call last):
File "", line 1, in
File "path\venv\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "path\venv\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
ERROR - passt_Dcase2020 - Failed after 0:00:12!

Also the following error :
Traceback (most recent calls WITHOUT Sacred internals):
File "ex_dcase.py", line 436, in default_command
return main()
File "ex_dcase.py", line 275, in main
trainer.fit(
File "path\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "path\venv\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "path\venv\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_roll_func..roll_func'

Can you help me fix the above error, or suggest any changes that could work ?

Longer input?

Could not solve for environment specs

I clone the repo. As per the README:

conda install mamba -n base -c conda-forge

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/miniconda3

  added / updated specs:
    - mamba


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-22.11.1              |   py39h2804cbe_1         873 KB  conda-forge
    fmt-9.1.0                  |       hffc8910_0         171 KB  conda-forge
    krb5-1.20.1                |       h127bd45_0         1.0 MB  conda-forge
    libarchive-3.5.2           |       h69ec738_3         1.5 MB  conda-forge
    libcurl-7.87.0             |       hbe9bab4_0         304 KB  conda-forge
    libedit-3.1.20191231       |       hc8eb9b7_2          94 KB  conda-forge
    libev-4.33                 |       h642e427_1          98 KB  conda-forge
    libmamba-1.1.0             |       h1254013_2         1.0 MB  conda-forge
    libmambapy-1.1.0           |   py39h8f82c16_2         214 KB  conda-forge
    libnghttp2-1.47.0          |       h232270b_1         816 KB  conda-forge
    libsolv-0.7.23             |       hb5ab8b9_0         373 KB  conda-forge
    libssh2-1.10.0             |       hb80f160_3         218 KB  conda-forge
    libxml2-2.9.14             |       h9d8dfc2_4         656 KB  conda-forge
    lz4-c-1.9.3                |       hbdafb3b_1         147 KB  conda-forge
    lzo-2.10                   |    h642e427_1000         154 KB  conda-forge
    mamba-1.1.0                |   py39hde45b87_2          48 KB  conda-forge
    openssl-1.1.1s             |       h03a7124_1         1.5 MB  conda-forge
    pybind11-abi-4             |       hd8ed1ab_3          10 KB  conda-forge
    reproc-14.2.4              |       h1a8c8d9_0          27 KB  conda-forge
    reproc-cpp-14.2.4          |       hb7217d7_0          20 KB  conda-forge
    yaml-cpp-0.7.0             |       hb7217d7_2         133 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         9.4 MB

The following NEW packages will be INSTALLED:

  fmt                conda-forge/osx-arm64::fmt-9.1.0-hffc8910_0 
  icu                conda-forge/osx-arm64::icu-70.1-h6b3803e_0 
  krb5               conda-forge/osx-arm64::krb5-1.20.1-h127bd45_0 
  libarchive         conda-forge/osx-arm64::libarchive-3.5.2-h69ec738_3 
  libcurl            conda-forge/osx-arm64::libcurl-7.87.0-hbe9bab4_0 
  libedit            conda-forge/osx-arm64::libedit-3.1.20191231-hc8eb9b7_2 
  libev              conda-forge/osx-arm64::libev-4.33-h642e427_1 
  libiconv           conda-forge/osx-arm64::libiconv-1.17-he4db4b2_0 
  libmamba           conda-forge/osx-arm64::libmamba-1.1.0-h1254013_2 
  libmambapy         conda-forge/osx-arm64::libmambapy-1.1.0-py39h8f82c16_2 
  libnghttp2         conda-forge/osx-arm64::libnghttp2-1.47.0-h232270b_1 
  libsolv            conda-forge/osx-arm64::libsolv-0.7.23-hb5ab8b9_0 
  libssh2            conda-forge/osx-arm64::libssh2-1.10.0-hb80f160_3 
  libxml2            conda-forge/osx-arm64::libxml2-2.9.14-h9d8dfc2_4 
  lz4-c              conda-forge/osx-arm64::lz4-c-1.9.3-hbdafb3b_1 
  lzo                conda-forge/osx-arm64::lzo-2.10-h642e427_1000 
  mamba              conda-forge/osx-arm64::mamba-1.1.0-py39hde45b87_2 
  pybind11-abi       conda-forge/noarch::pybind11-abi-4-hd8ed1ab_3 
  reproc             conda-forge/osx-arm64::reproc-14.2.4-h1a8c8d9_0 
  reproc-cpp         conda-forge/osx-arm64::reproc-cpp-14.2.4-hb7217d7_0 
  yaml-cpp           conda-forge/osx-arm64::yaml-cpp-0.7.0-hb7217d7_2 
  zstd               conda-forge/osx-arm64::zstd-1.5.2-h8128057_4 

The following packages will be UPDATED:

  ca-certificates    pkgs/main::ca-certificates-2022.10.11~ --> conda-forge::ca-certificates-2022.12.7-h4653dfc_0 
  libcxx                pkgs/main::libcxx-12.0.0-hf6beb65_1 --> conda-forge::libcxx-14.0.6-h2692d47_0 
  libzlib                                 1.2.12-ha287fd2_2 --> 1.2.13-h03a7124_4 
  openssl              pkgs/main::openssl-1.1.1s-h1a28f6b_0 --> conda-forge::openssl-1.1.1s-h03a7124_1 
  zlib                    pkgs/main::zlib-1.2.12-h5a0b063_2 --> conda-forge::zlib-1.2.13-h03a7124_4 

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            pkgs/main/osx-arm64::certifi-2022.12.~ --> conda-forge/noarch::certifi-2022.12.7-pyhd8ed1ab_0 
  conda              pkgs/main::conda-22.11.1-py39hca03da5~ --> conda-forge::conda-22.11.1-py39h2804cbe_1 


Proceed ([y]/n)? 


Downloading and Extracting Packages
                                                                                                                                     
Preparing transaction: done                                                                                                          
Verifying transaction: done                                                                                                          
Executing transaction: done

But then the next mambo command fails :\

mamba env create -f environment.yml

with

pkgs/r/osx-arm64                                              No change
pkgs/main/osx-arm64                                           No change
pkgs/main/noarch                                              No change
pkgs/r/noarch                                                 No change
conda-forge/osx-arm64                                4.7MB @ 351.1kB/s 13.6s
conda-forge/noarch                                  10.7MB @ 566.8kB/s 19.2s

                                                                                                                                     
Looking for: ['_libgcc_mutex==0.1=conda_forge', '_openmp_mutex==4.5=2_gnu', '_pytorch_select==0.1=cpu_0', 'appdirs==1.4.4=pyh9f0ad1d_0', 'audioread==2.1.9=py37h89c1867_4', 'blas==1.0=mkl', 'brotlipy==0.7.0=py37h5e8e339_1001', 'bzip2==1.0.8=h7f98852_4', 'c-ares==1.17.1=h7f98852_1', 'ca-certificates==2020.12.5=ha878542_0', 'cached-property==1.5.2=hd8ed1ab_1', 'cached_property==1.5.2=pyha770c72_1', 'certifi==2020.12.5=py37h89c1867_1', 'cffi==1.14.5=py37hc58025e_0', 'chardet==4.0.0=py37h89c1867_3', 'colorama==0.4.4=pyh9f0ad1d_0', 'cryptography==3.4.6=py37h5d9358c_0', 'cycler==0.10.0=py_2', 'decorator==4.4.2=py_0', 'docopt==0.6.2=py_1', 'ffmpeg==4.3.1=hca11adc_2', 'freetype==2.10.4=h0708190_1', 'gettext==0.19.8.1=h0b5b191_1005', 'gitdb==4.0.5=pyhd8ed1ab_1', 'gitpython==3.1.14=pyhd8ed1ab_0', 'gmp==6.2.1=h58526e2_0', 'gnutls==3.6.13=h85f3911_1', 'h5py==3.1.0=nompi_py37h1e651dc_100', 'hdf5==1.10.6=nompi_h6a2412b_1114', 'idna==2.10=pyh9f0ad1d_0', 'importlib-metadata==3.7.3=py37h89c1867_0', 'importlib_metadata==3.7.3=hd8ed1ab_0', 'intel-openmp==2020.2=254', 'joblib==1.0.1=pyhd8ed1ab_0', 'jpeg==9d=h36c2ea0_0', 'jsonpickle==1.4.1=pyh9f0ad1d_0', 'kiwisolver==1.3.1=py37h2527ec5_1', 'krb5==1.17.2=h926e7f8_0', 'lame==3.100=h7f98852_1001', 'lcms2==2.12=hddcbb42_0', 'ld_impl_linux-64==2.35.1=hea4e1c9_2', 'libblas==3.9.0=1_h86c2bf4_netlib', 'libcblas==3.9.0=5_h92ddd45_netlib', 'libcurl==7.75.0=hc4aaa36_0', 'libedit==3.1.20191231=he28a2e2_2', 'libev==4.33=h516909a_1', 'libffi==3.3=h58526e2_2', 'libflac==1.3.3=h9c3ff4c_1', 'libgcc-ng==9.3.0=h2828fa1_19', 'libgfortran-ng==9.3.0=hff62375_19', 'libgfortran5==9.3.0=hff62375_19', 'libgomp==9.3.0=h2828fa1_19', 'liblapack==3.9.0=5_h92ddd45_netlib', 'libllvm10==10.0.1=he513fc3_3', 'libnghttp2==1.43.0=h812cca2_0', 'libogg==1.3.4=h7f98852_1', 'libopenblas==0.3.12=pthreads_h4812303_1', 'libopus==1.3.1=h7f98852_1', 'libpng==1.6.37=h21135ba_2', 'librosa==0.8.0=pyh9f0ad1d_0', 'libsndfile==1.0.31=h9c3ff4c_1', 'libssh2==1.9.0=ha56f1ee_6', 'libstdcxx-ng==9.3.0=h6de172a_19', 'libtiff==4.2.0=hbd63e13_2', 'libvorbis==1.3.7=h9c3ff4c_0', 'libwebp-base==1.2.0=h7f98852_2', 'libzlib==1.2.11=h36c2ea0_1013', 'llvm-openmp==11.1.0=h4bd325d_1', 'llvmlite==0.36.0=py37h9d7f4d0_0', 'lz4-c==1.9.3=h9c3ff4c_1', 'matplotlib-base==3.3.4=py37h0c9df89_0', 'mkl==2020.2=256', 'mkl-service==2.3.0=py37h8f50634_2', 'munch==2.5.0=py_0', 'ncurses==6.2=h58526e2_4', 'nettle==3.6=he412f7d_0', 'ninja==1.10.2=h4bd325d_0', 'numba==0.53.0=py37h7dd73a4_1', 'numpy==1.20.1=py37haa41c4c_0', 'olefile==0.46=pyh9f0ad1d_1', 'openblas==0.3.12=pthreads_h04b7a96_1', 'openh264==2.1.1=h780b84a_0', 'openjpeg==2.4.0=hb52868f_1', 'openssl==1.1.1k=h7f98852_0', 'packaging==20.9=pyh44b312d_0', 'pandas==1.2.3=py37hdc94413_0', 'pillow==8.1.2=py37h4600e1f_1', 'pip==21.0.1=pyhd8ed1ab_0', 'pooch==1.3.0=pyhd8ed1ab_0', 'py-cpuinfo==7.0.0=pyh9f0ad1d_0', 'pycparser==2.20=pyh9f0ad1d_2', 'pyopenssl==20.0.1=pyhd8ed1ab_0', 'pyparsing==2.4.7=pyhd8ed1ab_1', 'pysocks==1.7.1=py37h89c1867_5', 'pysoundfile==0.10.3.post1=pyhd3deb0d_0', 'python==3.7.10=hffdb5ce_100_cpython', 'python-dateutil==2.8.1=py_0', 'python_abi==3.7=3_cp37m', 'pytz==2021.1=pyhd8ed1ab_0', 'readline==8.0=he28a2e2_2', 'requests==2.25.1=pyhd3deb0d_0', 'resampy==0.2.2=py_0', 'scikit-learn==0.24.1=py37h69acf81_0', 'scipy==1.6.1=py37h14a347d_0', 'setuptools==49.6.0=py37h89c1867_3', 'six==1.15.0=pyh9f0ad1d_0', 'smmap==3.0.5=pyh44b312d_0', 'sqlite==3.34.0=h74cdb3f_0', 'threadpoolctl==2.1.0=pyh5ca1d4c_0', 'tk==8.6.10=h21135ba_1', 'tornado==6.1=py37h5e8e339_1', 'typing_extensions==3.7.4.3=py_0', 'urllib3==1.26.4=pyhd8ed1ab_0', 'wrapt==1.12.1=py37h5e8e339_3', 'x264==1!161.3030=h7f98852_1', 'xz==5.2.5=h516909a_1', 'zipp==3.4.1=pyhd8ed1ab_0', 'zlib==1.2.11=h36c2ea0_1013', 'zstd==1.4.9=ha95c52a_0']


Could not solve for environment specs
Encountered problems while solving:
  - nothing provides requested _libgcc_mutex ==0.1 conda_forge
  - nothing provides requested _openmp_mutex ==4.5 2_gnu
  - nothing provides requested audioread ==2.1.9 py37h89c1867_4
  - nothing provides requested blas ==1.0 mkl
  - nothing provides requested brotlipy ==0.7.0 py37h5e8e339_1001
  - nothing provides requested bzip2 ==1.0.8 h7f98852_4
  - nothing provides requested c-ares ==1.17.1 h7f98852_1
  - nothing provides requested ca-certificates ==2020.12.5 ha878542_0
  - nothing provides requested certifi ==2020.12.5 py37h89c1867_1
  - nothing provides requested cffi ==1.14.5 py37hc58025e_0
  - nothing provides requested chardet ==4.0.0 py37h89c1867_3
  - nothing provides requested cryptography ==3.4.6 py37h5d9358c_0
  - nothing provides requested ffmpeg ==4.3.1 hca11adc_2
  - nothing provides requested freetype ==2.10.4 h0708190_1
  - nothing provides requested gettext ==0.19.8.1 h0b5b191_1005
  - nothing provides requested gmp ==6.2.1 h58526e2_0
  - nothing provides requested gnutls ==3.6.13 h85f3911_1
  - nothing provides requested h5py ==3.1.0 nompi_py37h1e651dc_100
  - nothing provides requested hdf5 ==1.10.6 nompi_h6a2412b_1114
  - nothing provides requested importlib-metadata ==3.7.3 py37h89c1867_0
  - nothing provides requested intel-openmp ==2020.2 254
  - nothing provides requested jpeg ==9d h36c2ea0_0
  - nothing provides requested kiwisolver ==1.3.1 py37h2527ec5_1
  - nothing provides requested krb5 ==1.17.2 h926e7f8_0
  - nothing provides requested lame ==3.100 h7f98852_1001
  - nothing provides requested lcms2 ==2.12 hddcbb42_0
  - nothing provides requested ld_impl_linux-64 ==2.35.1 hea4e1c9_2
  - nothing provides requested libblas ==3.9.0 1_h86c2bf4_netlib
  - nothing provides requested libcblas ==3.9.0 5_h92ddd45_netlib
  - nothing provides requested libcurl ==7.75.0 hc4aaa36_0
  - nothing provides requested libedit ==3.1.20191231 he28a2e2_2
  - nothing provides requested libev ==4.33 h516909a_1
  - nothing provides requested libffi ==3.3 h58526e2_2
  - nothing provides requested libflac ==1.3.3 h9c3ff4c_1
  - nothing provides requested libgcc-ng ==9.3.0 h2828fa1_19
  - nothing provides requested libgfortran-ng ==9.3.0 hff62375_19
  - nothing provides requested libgfortran5 ==9.3.0 hff62375_19
  - nothing provides requested libgomp ==9.3.0 h2828fa1_19
  - nothing provides requested liblapack ==3.9.0 5_h92ddd45_netlib
  - nothing provides requested libllvm10 ==10.0.1 he513fc3_3
  - nothing provides requested libnghttp2 ==1.43.0 h812cca2_0
  - nothing provides requested libogg ==1.3.4 h7f98852_1
  - nothing provides requested libopenblas ==0.3.12 pthreads_h4812303_1
  - nothing provides requested libopus ==1.3.1 h7f98852_1
  - nothing provides requested libpng ==1.6.37 h21135ba_2
  - nothing provides requested libsndfile ==1.0.31 h9c3ff4c_1
  - nothing provides requested libssh2 ==1.9.0 ha56f1ee_6
  - nothing provides requested libstdcxx-ng ==9.3.0 h6de172a_19
  - nothing provides requested libtiff ==4.2.0 hbd63e13_2
  - nothing provides requested libvorbis ==1.3.7 h9c3ff4c_0
  - nothing provides requested libwebp-base ==1.2.0 h7f98852_2
  - nothing provides requested libzlib ==1.2.11 h36c2ea0_1013
  - nothing provides requested llvm-openmp ==11.1.0 h4bd325d_1
  - nothing provides requested llvmlite ==0.36.0 py37h9d7f4d0_0
  - nothing provides requested lz4-c ==1.9.3 h9c3ff4c_1
  - nothing provides requested matplotlib-base ==3.3.4 py37h0c9df89_0
  - nothing provides requested mkl ==2020.2 256
  - nothing provides requested mkl-service ==2.3.0 py37h8f50634_2
  - nothing provides requested ncurses ==6.2 h58526e2_4
  - nothing provides requested nettle ==3.6 he412f7d_0
  - nothing provides requested ninja ==1.10.2 h4bd325d_0
  - nothing provides requested numba ==0.53.0 py37h7dd73a4_1
  - nothing provides requested numpy ==1.20.1 py37haa41c4c_0
  - nothing provides requested openblas ==0.3.12 pthreads_h04b7a96_1
  - nothing provides requested openh264 ==2.1.1 h780b84a_0
  - nothing provides requested openjpeg ==2.4.0 hb52868f_1
  - nothing provides requested openssl ==1.1.1k h7f98852_0
  - nothing provides requested pandas ==1.2.3 py37hdc94413_0
  - nothing provides requested pillow ==8.1.2 py37h4600e1f_1
  - nothing provides requested pysocks ==1.7.1 py37h89c1867_5
  - nothing provides requested python ==3.7.10 hffdb5ce_100_cpython
  - nothing provides requested readline ==8.0 he28a2e2_2
  - nothing provides requested scikit-learn ==0.24.1 py37h69acf81_0
  - nothing provides requested scipy ==1.6.1 py37h14a347d_0
  - nothing provides requested setuptools ==49.6.0 py37h89c1867_3
  - nothing provides requested sqlite ==3.34.0 h74cdb3f_0
  - nothing provides requested tk ==8.6.10 h21135ba_1
  - nothing provides requested tornado ==6.1 py37h5e8e339_1
  - nothing provides requested wrapt ==1.12.1 py37h5e8e339_3
  - nothing provides requested x264 ==1!161.3030 h7f98852_1
  - nothing provides requested xz ==5.2.5 h516909a_1
  - nothing provides requested zlib ==1.2.11 h36c2ea0_1013
  - nothing provides requested zstd ==1.4.9 ha95c52a_0
  - package pytz-2021.1-pyhd8ed1ab_0 requires python >=3, but none of the providers can be installed

The environment can't be solved, aborting the operation

This is on an OSX Apple Silicon machine

Inference ESC-50 fine-tuned model

Hello, authors.
Thank you for sharing the great work.

I tried to fine-tuned AudioSet pretrained model passt-s-f128-p16-s10-ap.476-swa.pt on ESC-50 dataset by using ex_esc50.py.
I got checkpoints saved in output/esc50/_None/checkpoints/epoch=4-step=2669.ckpt.
I want to load the checkpoint and inference with audio file. I am trying to load the checkpoint model and tried to used passt_hear21 for inference but kinda lost track of the process.

Could you please share how to inference with the saved checkpoints on audio file?

Openmic2018

Hi authors!

Will you release the code for the openmic2018?

Thanks a lot.

Is it possible to install the passt with python=3.6?

Hi, thanks so much for sharing the great work! I'd like to use PaSST for downstream tasks and integrate it into existing conda environment with python=3.6 (it 's kind of painful to upgrade python from 3.6 to 3.7/3.8 due to many inconsistent packages). I know that python>=3.7 is required to install PaSST, but I'm wandering if it's possible to install it with python=3.6?

is `config.dyn_norm` enabled?

I wasn't able to find this parameter in the config: I self.config.dyn_norm enabled in the mel_forward function?

ImportError: cannot import name 'F1' from 'torchmetrics' (/app/anaconda3/lib/python3.7/site-packages/torchmetrics/init.py)

python ex_openmic.py
Traceback (most recent call last):
File "ex_openmic.py", line 5, in
from pytorch_lightning.callbacks import ModelCheckpoint
File "/root/work_project_2021/project_music2video/PaSST/src/pytorch-lightning/pytorch_lightning/init.py", line 65, in
from pytorch_lightning import metrics
File "/root/work_project_2021/project_music2video/PaSST/src/pytorch-lightning/pytorch_lightning/metrics/init.py", line 16, in
from pytorch_lightning.metrics.classification import ( # noqa: F401
File "/root/work_project_2021/project_music2video/PaSST/src/pytorch-lightning/pytorch_lightning/metrics/classification/init.py", line 19, in
from pytorch_lightning.metrics.classification.f_beta import F1, FBeta # noqa: F401
File "/root/work_project_2021/project_music2video/PaSST/src/pytorch-lightning/pytorch_lightning/metrics/classification/f_beta.py", line 16, in
from torchmetrics import F1 as _F1
ImportError: cannot import name 'F1' from 'torchmetrics' (/app/anaconda3/lib/python3.7/site-packages/torchmetrics/init.py)

envs:
Name: torch
Version: 1.12.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /app/anaconda3/lib/python3.7/site-packages
Requires: typing-extensions
Required-by: torchvision, torchmetrics, torchaudio, timm, test-tube, Ba3l, pytorch-lightning

Wavmix for the ESC50 dataset

Hello, thanks a lot for you amazing work and for publishing the code!

I was trying to run the ex_esc50.py with wavmix=True but got the error:

RuntimeError: "nll_loss_forward_no_reduce_cuda_kernel_index" not implemented for 'Double'

since when using wavmix the ground truth is not an integer anymore.

Would it not be more appropriate to use the KL-divergence as loss function instead of the crossentropy?

can use on 8k audio ?

Error when trying to pip install repo

Hi @kkoutini,

I get an error when running the following line after my conda env creation:

Any idea?

This line was working a couple of months ago when I had created a first environment. But not anymore it seems.

Many thanks

Antoine

Evaluate my own model

Hi authors! How can I evaluate my own trained model?

kkoutini / passt Goto Github PK

passt's Introduction

PaSST: Efficient Training of Audio Transformers with Patchout

Table of contents

Pre-trained models for Inference and embeddings extractions

Getting the logits from the pretrained models

Getting a pre-trained model for fine tuning

Development environment

Setting up the development experiments environment

Setting up using the exported conda environment

Checking the environment

Getting started

General information

Configuring the experiment

Training on Audioset

Examples with Pre-trained models

Examples of fine-tuning on downstream datasets

Citation

Contact

passt's People

Contributors

Stargazers

Watchers

Forkers

passt's Issues

Recommend Projects

Recommend Topics

Recommend Org