Abstract

Fine-tuning of self-supervised models is a powerful transfer learning method in a variety of fields, including speech pro- cessing, since it can utilize generic feature representations obtained from large amounts of unlabeled data. Fine-tuning however, requires a new parameter set for each downstream task, which is parameter inefficient. Adapter architecture is proposed to partially solve this issue by inserting lightweight learnable modules into a frozen pre-trained model. However, existing adapter architectures fail to adaptively leverage low- to high-level features stored in different layers, which is necessary for solving various kinds of speech processing tasks. Thus, we propose a new adapter architecture to acquire fea- ture representations more flexibly for various speech tasks. In experiments, we applied this adapter to WavLM on four speech tasks. It performed on par or better than na ̈ıve fine-tuning with only 11% of learnable parameters. It also outperformed an existing adapter architecture.

Adapter Architecture

The proposed adapter architecture incorporates two types of adapters, namely Layer adapters (L-adapters) and Encoder adapter (E-adapters), into a frozen backbone. The L-adapters bridge each intermediate layer and the top layer as shown in Figure 1a. They help the model to quickly adapt speech representations to various downstream tasks, and also to reduce dependency on the initialization of adapter parameters. The E-adapters are inserted to each encoder layer in a similar way as previous work (https://arxiv.org/pdf/2202.03218.pdf) as shown in Figure 1b. In contrast to the previous work, our architecture does not have adapters after the multi-head self-attention (MHSA) modules, and alternatively has L-adapters. We use wavlm-base-plus as the model backbone.

Installation and Running experiments

You need to install packages necessary for running experiments. Please run the following command.

pip install -r requirement.txt

The following command provides an example of training in the proposed method. Please select task_name from ASV, ER, ASR, or IC.

# ./run.sh task_name
./run.sh ASR

Experiment and Results

We demonstrate the effectiveness of the proposed method on four downstream tasks: automatic speaker verification (ASV), emotion recognition (ER), automatic speech recognition (ASR) and intent classification (IC). We conduct experiments to compare the performance of different five training methods in four tasks. All experiments in this work were conducted with four 16GB memory GPUs.

We run experiments on five training methods as follows.

Fine-tuning the top $l$ layers for $l = 1, 2, \dots ,12$.

Conventional method: Adapters are inserted after MHSA and feedforward modules in the top $l$ layers of for $l = 1, 2, \dots , 12$.

Proposed method: L-adapters are attached to the top $k$ layers for $k = 1, 2, \dots , 12$ and E-adapters are inserted in the $l$ layers from the second layer from the top for $l = 1, 2, \dots, 11$.

L-adapters-only: L-adapters are attached to all layers without E-adapters.

E-adapters-only: E-adapters are inserted into all layers without L-adapters.

The performance comparison is shown in the figure and the table below. The table shows the error rate values of the right ends of curves in the figure.

Method	# Params	ASV	ER	ASR	IC
Fine-tuning	85.1 M	$4.42\pm 0.25$	$21.0 \pm 0.62$	$\boldsymbol{7.87} \pm 0.08$	$0.35 \pm 0.08$
Conventional method	9.53 M	$3.95 \pm0.29$	$20.8 \pm 0.44$	$8.92 \pm 0.13$	$0.39 \pm 0.04$
Proposed method	9.13 M	$\boldsymbol{2.63}\pm 0.09$	$\boldsymbol{20.0} \pm 0.31$	$7.90 \pm 0.06$	$\boldsymbol{0.33} \pm 0.04$
L-adapters-only	4.74 M	$2.74\pm 0.09$	$21.1 \pm 0.52$	$9.50 \pm 0.08$	$\boldsymbol{0.33}\pm 0.04$
E-adapters-only	4.79 M	$4.82\pm 0.02$	$23.1 \pm 0.48$	$9.00 \pm 0.16$	$0.34 \pm 0.04$

Optimal learning rates

We used a scheduler that warms to the maximum learning rate and then decays for ASV, ASR all but the conventional method, and IC. A scheduler that decays every certain step size from the initial learning rate is used for the others. We chose the best maximum and initial learning rates from {1e-3, 5e-4, 1e-4, 5e-5, 1e-5} for each architecture all but the down stream head.

Method	Module	ASV	ER	ASR	IC
Fime-tuning	Downstream head Encoder	5e-4 1e-4	5e-4 5e-5	1e-2 1e-4	5e-4 1e-4
Conventional method	Downstream head Adapters Layernorm layer	5e-4 1e-5 1e-5	5e-4 1e-5 1e-5	1e-3 1e-5 1e-5	5e-4 1e-5 1e-5
Proposed method	Downstream head L-adapters E-adapters Layernorm layer	5e-4 1e-4 1e-5 1e-5	5e-4 1e-4 5e-5 5e-5	2e-3 1e-3 1e-3 1e-3	5e-4 1e-5 1e-5 1e-5
L-adapters-only	Downstream head L-adapters Layernorm layer	5e-4 5e-4 5e-4	5e-4 5e-4 5e-4	2e-3 1e-3 1e-3	5e-4 1e-4 1e-4
E-adapters-only	Downstream head E-adapters Layernorm layer	5e-4 1e-5 1e-5	5e-4 1e-5 1e-5	2e-3 1e-5 1e-5	5e-4 1e-5 1e-5

Citation

Please use the following citation for this work:

@inproceedings{otake2023parameter,
  title = {Parameter Efficient Transfer Learning for Various Speech Processing Tasks},
  author = {S. Otake, R. Kawakami, N. Inoue},
  booktitle = {Proc. ICASSP},
  year = {2023},
}

Note

The paper was uploaded to arXiv on 6 Dec 2022.

The paper was accepted for ICASSP 2023.

ts0923 / adapter-wavlm Goto Github PK

adapter-wavlm's Introduction

Abstract

Adapter Architecture

Installation and Running experiments

Experiment and Results

Optimal learning rates

Citation

Note

adapter-wavlm's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent