This repo aims to explore efficient fine-tuning techniques for deep learning models, focusing on reducing the computational and environmental costs associated with training large-scale models. We emphasize methods like Low-Rank Adaptation (LoRA), VeRA, and Quantization, which aim to enhance the fine-tuning process for foundational models.
Training deep learning models typically requires significant memory and sophisticated infrastructure. As models become increasingly parametrized — for instance, GPT-3 contains approximately 175 billion parameters — the challenge of efficiently training and adapting these models to new tasks grows. Traditional fine-tuning approaches, which re-train all model weights for each new task, are not only resource-intensive but also impractical in terms of storage and computational power requirements.
Introduced in 2021 (LoRA: Low-Rank Adaptation of Large Language Models) by Edward J. Hu and al., LoRA presents a method that requires training only a fraction of a model's weights. This approach significantly reduces the number of weights to train and the storage space needed, while maintaining inference time.
Key Features of LoRA:
- Reduces the number of trainable weights to a fraction of the total weights.
- Saves the base model once and the parallel architecture multiple times to create different models for different tasks.
- Does not increase inference time.
- Simplifies implementation and understanding.
- Applicable to any layer involving large-scale matrix products, such as transformer attention modules and linear layers.
Here is a schematic illustration of the LoRA architecture here to visually explain how LoRA adds new matrices in parallel to existing ones, and how these interact with the intrinsic low-rank structure of the model.
Impressive results on GPT-3 have been showcased in the original paper. We want to reproduce these results on LayoutLM for a Token Classification task on the FUNSD dataset.
This study aims to reproduce the impressive results previously showcased for GPT-3, focusing now on LayoutLM applied to the FUNSD dataset. LayoutLM, introduced 2020 (LayoutLM: Pre-training of Text and Layout for Document Image Understanding) by Xu and al., is a model that uniquely integrates text and layout information from scanned document images. Building on the BERT architecture, LayoutLM incorporates two novel input embeddings:
- 2D Position Embedding: Captures the spatial position of elements in document images.
- Image Embedding: Enhances the linguistic representation with visual features.
We utilize the FUNSD dataset, which includes annotated samples of scanned documents. Each document is processed using the LayoutLM tokenizer available from Microsoft's repository on HuggingFace. The model used is LayoutLMForTokenClassification
, also sourced from HuggingFace.
Training involves the use of a typical optimizer like AdamW with a learning rate initially set to 2e-5. We employ early stopping based on validation loss to prevent overfitting and ensure optimal performance. Each model configuration is evaluated multiple times (typically 10 to 20 runs) to generate confidence intervals for the reported metrics.
For evaluation, we employ common metrics such as precision, recall, and F1 score, calculated across multiple test splits to ensure the robustness and reproducibility of our results.
We aim to demonstrate that with LoRA, LayoutLM can achieve or even surpass its original performance on token classification tasks while requiring fewer computational resources. Results will be presented comparing the efficiency and effectiveness of various fine-tuning techniques.
Our ablation study, as documented in the accompanying tables, provides a comprehensive view of how Low-Rank Adaptation (LoRA) can be efficiently utilized for fine-tuning models. We also evaluated the impact of varying the rank
This first table delves into the effect of altering the target matrices and the rank
Here, we compared the performance impact of different values of
![]() |
![]() |
Eventually, we compared different configurations applied to LoRA fine-tuning. The LoHaConfig
outperformed others with an F1 Score of 0.691 ± 0.001, for the best combination of
Building on the efficiencies introduced by the Low-Rank Adaptation (LoRA) algorithm, which dramatically reduces memory usage and enhances training efficiency in fine-tuning large language models, the VeRA algorithm offers further innovations. Introduced on October 2023, VeRA: Vector-based Random Matrix Adaptation by Dawid J. Kopiczko and al. significantly streamlines the parameter tuning process. For instance, while LoRA might train between 4 million to 50 million parameters in scenarios like fine-tuning the Llama2 7B model, VeRA reduces the number of trainable parameters to a tenth of what is required by LoRA, enhancing efficiency even further.
In the VeRA architecture depicted here, the typical freezing of pre-trained weights
Comparison of Parameter Count Formulas:
-
LoRA:
$| \Theta | = 2 \times L_{tuned} \times d_{model} \times r$ -
VeRA:
$| \Theta | = L_{tuned} \times (d_{model} + r)$
The parameter count for LoRA and VeRA highlights a fundamental difference in scalability. LoRA's parameter requirements scale multiplicatively with the model's dimension and the rank, potentially leading to higher computational costs, especially for larger models or higher ranks. In contrast, VeRA's parameter increase is additive, offering potentially better scalability for applications where maintaining a low rank suffices, thus conserving computational resources.
Below is a table displaying the VeRA model's performance across different ranks
VeRA Results with Various Values of r
This table highlights that
Introduced in May 2023, QLoRA: Efficient Finetuning of Quantized LLMs by Tim Dettmers and al., QLoRA extends the Low-Rank Adaptation (LoRA) algorithm by introducing quantization to the precision of weight parameters in pre-trained Large Language Models (LLMs). Typically, trained model parameters are stored in a 32-bit format. QLoRA, however, compresses these parameters to 4-bit precision using the NF4 data type, which is specifically designed for AI applications. This quantization significantly reduces the memory footprint, enabling fine-tuning of LLMs on single GPUs and making it feasible to run large models on less powerful hardware, including consumer-grade GPUs.
Comparison of QLoRA with Other Methods
NF4 is optimally designed for data with a normal distribution, a common characteristic of neural network weights. This allows it to represent these weights more accurately within a constrained 4-bit format compared to the standard 4-bit float. Standard 4-bit floats, being more general-purpose, are less suited for AI applications due to their limited precision and range, which can be a significant drawback in tasks requiring high precision.
QLoRA Results with Various Values of r
The F1 score results indicate the efficacy of each adaptation strategy in maintaining high precision and recall balances. The graph shows that as
![]() |
![]() |
The training time comparison highlights the computational efficiency of each model. LoRA generally shows lower training times compared to QLoRA, which suggests that the quantization process in QLoRA, while reducing memory usage, adds to the computational burden.
Memory usage is a crucial factor, especially in resource-constrained environments. The comparison clearly shows that QLoRA significantly reduces memory usage across all ranks compared to LoRA. This reduction is vital for deploying large models on hardware with limited memory capacity.
In summary, the choice between LoRA and QLoRA should be guided by specific application needs:
- For accuracy: Choose QLoRA.
- For faster training times: Opt for LoRA.
- For lower memory usage: QLoRA is the best choice.
Our findings using LayoutLM model on the FUNSD dataset demonstrate LoRA and parameter efficient fine-tuning techniques in general present unique advantages for accuracy-critical applications with hardware limitations. LoRA provides a faster training experience, suitable for scenarios where training speed is a priority. This comprehensive analysis aids in making informed decisions about which model adaptation technique to deploy based on specific needs and constraints.
- Low-Rank Adaptation of Large Language Models
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding
- Introduction to Vector-based Random Matrix Adaptation
- QLoRA: Quantized Low-Rank Adaptation for Efficient Model Tuning
- Nicolas Dop - [email protected]
- Raphaël Ferroni - [email protected]
- Amine Baccar - [email protected]
- Marco Hakim - [email protected]