Git Product home page Git Product logo

llm-research's Introduction

LLaMA fine-tuning

1. Self-instruct Framework

The Self-instruct Framework relies on two main components for generating effective training data: prompts and manual seed tasks.

For a detailed understanding and background of the Self-instruct Framework, refer to the paper available in the repository: Self-Instruct Framework Paper.

Prompt and Manual Seed Tasks

  • Prompts: These serve as guiding directives for the model, ensuring that the generated outputs are not only relevant but also diverse and in alignment with specific criteria. Prompts play a crucial role in instructing and controlling Chat-GPT during the instruction generation process. They set clear parameters and expectations, dictating the nature and scope of the model's output in response to varied scenarios or tasks.

For practical examples of how prompts are utilized in specific contexts, explore different prompts used for BGP-LLaMA. These include prompts for general BGP knowledge, use of PyBGPStream library, and PyBGPStream real-time analysis.

  • Manual Seed Tasks: These are manually crafted tasks that supply the model with examples of the desired output. They are crucial for instructing the model on how to appropriately respond to the prompts. Just like prompts, seed tasks are created with a particular function in mind. In the case of BGP-LLaMA, you can find seed tasks made for different BGP knowledge areas and functionalities in dataset.

Running the Self-instruct Code

  • utils.py: The script manages the connection to OpenAI's GPT models and handles the communication process. The script manages decoding arguments, rate-limiting with sleep intervals, batching of prompts for efficiency, and the handling of API responses. IMPORTANT!: After the OpenAI API updates in 2024, older models are deprecated. Thus, utils.py must be updated to align with the current API standards and model availability.

  • generate_instruction.py: The main script for the self-instruct framework. Before running it, you need to modify the script to specify the directory paths for the prompts and seed tasks. The script then loads, encodes these components, and generates new sets of instructions. To execute the script, use the following command line example, ensuring to replace --model_name with the current model you intend to use and adjust --num_instructions_to_generate as needed:

python -m generate_instruction generate_instruction_following_data --num_instructions_to_generate=1000 --model_name="text-davinci-003"

2. Instruction Fine-tuning

The script for instruction fine-tuning of LLaMA model for specific tasks such as 5G data analysis and BGP routing analysis is located in this directory.

Prerequisites:

  1. Hugging Face Account: To access model weights and tokenizers from Hugging Face, you need to register for an account on Hugging Face. Once registered, generate a personal access token from your account settings under the Access Tokens section.

  2. Environment Setup: Ensure Python and the necessary libraries (transformers, torch, etc.) are installed in your environment. Use a virtual environment for a cleaner setup.

Installing Dependencies

Start by installing all necessary dependencies listed in requirements.txt to ensure your environment is properly set up for the fine-tuning process.

pip install -r requirements.txt

Model & Tokenizer Loading

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map="auto",
    use_auth_token=hf_auth
)
  • The script starts by specifying a model_id that identifies which model to load from Hugging Face. For fine-tuning tasks mentioned, we use meta-llama/Llama-2-13b-chat-hf.
  • Quantization is enabled via BitsAndBytesConfig to reduce GPU memory usage, crucial for accommodating large models. More details on this technique can be found in the research paper.
  • Authentication with Hugging Face is required to access private or restricted models and features. Set your Hugging Face authentication token as an environment variable (hf_token).
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

The tokenizer is essential for converting input text into a format that the model can understand, and vice versa. It ensures that the text input is appropriately preprocessed (padding, tokenization) for the model. Setting tokenizer.pad_token = tokenizer.eos_token ensures that padding is handled correctly by using the end-of-sequence token as the padding token.

Data Loading and Processing:

In this part, we load training data from a specified JSON file, indicating the dataset's location. All training data for different fine-tuning tasks is organized under the finetune_main directory for various purposes.

data = load_dataset("json", data_files="/home/hb/LLM-research/dataset/5G/Mobile_LLaMA_1.json")

train_val = data["train"].train_test_split(
    test_size=1300, shuffle=True, seed=42
)
train_data = (
    train_val["train"].map(generate_and_tokenize_prompt)
)
val_data = (
    train_val["test"].map(generate_and_tokenize_prompt)
)

The dataset undergoes processing to format it into a structure that is conducive to model training. This involves generating prompts from the data points, which are then tokenized. The process is as follows:

  • Prompt Generation: The function generate_prompt combines instructions and outputs into a format that mirrors the input the model expects during fine-tuning.
  • Tokenization: The function tokenize Converts the generated prompts into a sequence of tokens, making it understandable and usable by the model. This includes truncation and the addition of end-of-sequence tokens as necessary.
  • Applying Processing: The processed data is applied to both the training and validation datasets. This ensures that all data fed into the model during the fine-tuning process is in the correct format, allowing for efficient and effective model training.

Finetuning - Low-Rank Adaptation (LoRA)

LoRA is a method designed to update only a small portion of the model parameters during the fine-tuning process. This approach significantly reduces the computational cost and memory footprint, making it feasible to fine-tune large models on limited resources.

The key parameters in LoRA configuration:

  • lora_alpha: Controls the scaling of the LoRA parameters. A value of 16 is a good starting point, balancing between the model's adaptability and its original knowledge retention.
  • lora_dropout: Specifies the dropout rate for the LoRA layers, set at 0.1 to prevent overfitting.
  • lora_r: Defines the rank of the adaptation, set at 64, indicating the size of the low-rank matrices.
  • bias: Set to "none" to indicate that no bias is used in the LoRA adaptation.
  • task_type: Specifies the type of task, in this case, "CAUSAL_LM" for causal language modeling.

Training Configuration

  • output_dir: The directory where the output files will be saved.
  • per_device_train_batch_size: Batch size per device, set to 4 to balance between training speed and memory usage.
  • gradient_accumulation_steps: Number of steps to accumulate gradients before updating model parameters.
  • optim: Specifies the optimizer, using "paged_adamw_32bit" for efficient memory usage.
  • logging_steps: Determines how often to log training information.
  • learning_rate: The initial learning rate for training, with 1e-4 as a starting point.
  • fp16: Enables mixed-precision training to reduce memory consumption.
  • max_grad_norm: The maximum gradient norm for gradient clipping, preventing exploding gradients.
  • max_steps: The total number of training steps.
  • warmup_ratio: The proportion of training to perform linear learning rate warmup.
  • group_by_length: Enables grouping of training data by length for more efficient padding.
  • lr_scheduler_type: The type of learning rate scheduler to use, with options like "cosine" for cosine learning rate decay.

Key Considerations for Fine-Tuning

  • Adjusting hyperparameters such as learning_rate, and lr_scheduler_type based on initial training outcomes can further optimize the fine-tuning process for better model performance.
  • The choice between "cosine" and "constant" learning rate scheduler can impact the model's learning trajectory and final performance. Experimenting with both can help identify the most effective approach for your specific task and dataset.

Saving the Model

model.push_to_hub('yourHF/model_name')
tokenizer.push_to_hub('yourHF/model_name')

The final part of the finetuning script is used to save to HuggingFace.

4. Tabular data processing

Directory contains scripts for fine-tuning the LLaMA model to process tabular data for anomaly detection tasks. The focus is on training the model to accurately identify and report anomalies within structured datasets.

The BGP features were extracted from a different repository: BGP_data_analysis. THe scripts are intended to enhance the model’s capabilities in handling tabular data, making it proficient in tasks such as anomaly detection. The tabular data are preprocessed into the following format:

[TAB] col: | timestamp | asn | num_routes | num_new_routes | num_withdrawals | num_origin_changes | num_route_changes | max_path_length | avg_path_length | max_edit_distance | avg_edit_distance | num_announcements | num_unique_prefixes_announced | row 1: | 2022-03-28 09:35:00 | 8342.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

5. Evaluation

Evaluation directory contains evaluation scripts. The script begins by loading the fine-tuned LLaMA model, preparing it for the evaluation process. It includes procedures to feed prompts to the model and the model's response based on the prompts. In addition, the script contains specialized code to evaluate the model's knowledge of BGP. This evaluation focuses on the accuracy of the model's responses to BGP-related prompts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.