Git Product home page Git Product logo

code-llama-fim-fine-tuning's Introduction

Code Llama fill-in-the-middle fine-tuning

This repository allows you to fine-tune the Code Llama model to fill in the middle on your own dataset by mirroring the process described in the original Code Llama paper. Infilling (filling in the middle) models are optimal for code completion tasks, where the model is given a prefix and a suffix and is asked to fill the middle. In the "Optimizing Large Language Models for OpenAPI Code Completion" paper, we improved Code Llama performance in OpenAPI completion by 28.6% outperforming GitHub Copilot by 55.2%.

How to use

Prepare a dataset and upload it to Hugging Face Hub.

The dataset must contain a column "content" with the code files you want to train the model on.

An example dataset with OpenAPI definitions is available at here.

Train the model using Google Colab

You can train the Code Llama 7B model using Google Colab with an A100 GPU. Use this notebook to train the model: Open In Colab

The notebook will save the trained adapter to your Hugging Face account. The adapter can be used with the Python Transformers library for inference (see docs). To create a standalone model, you can merge the adapter with the original model. The merged model can be used with the Hugging Face Inference Endpoints to serve the model as an API.

Obtain a standalone model

This Google Colab notebook can be used to merge the fine-tuned adapter with the Code Llama 7B model using the free Tesla T4 GPU but requires high-RAM: Open In Colab

Serve the model as an API

The merged model can be used with the Hugging Face Inference Endpoints to serve the model as an API. Code Llama 7B model requires a single Nvidia A10G runtime which costs $1.00 per hour at the time of writing.

Related Publication

This repository contains the code and data for the paper "Optimizing Large Language Models for OpenAPI Code Completion" by Bohdan Petryshyn and Mantas Lukoševičius.

Abstract

Recent advancements in Large Language Models (LLMs) and their utilization in code generation tasks have significantly reshaped the field of software development. Despite the remarkable efficacy of code completion solutions in mainstream programming languages, their performance lags when applied to less ubiquitous formats such as OpenAPI definitions. This study evaluates the OpenAPI completion performance of GitHub Copilot, a prevalent commercial code completion tool, and proposes a set of task-specific optimizations leveraging Meta's open-source model Code Llama. A semantics-aware OpenAPI completion benchmark proposed in this research is used to perform a series of experiments through which the impact of various prompt-engineering and fine-tuning techniques on the Code Llama model's performance is analyzed. The fine-tuned Code Llama model reaches a peak correctness improvement of 55.2% over GitHub Copilot despite utilizing 25 times fewer parameters than the commercial solution's underlying Codex model. Additionally, this research proposes an enhancement to a widely used code infilling training technique, addressing the issue of underperformance when the model is prompted with context sizes smaller than those used during training.

Citation

If you found the dataset or the fine-tuning code helpful, please reference the original paper:

BibTeX:

@misc{petryshyn2024optimizing,
      title={Optimizing Large Language Models for OpenAPI Code Completion}, 
      author={Bohdan Petryshyn and Mantas Lukoševičius},
      year={2024},
      eprint={2405.15729},
      archivePrefix={arXiv},
      primaryClass={id='cs.SE' full_name='Software Engineering' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers design tools, software metrics, testing and debugging, programming environments, etc. Roughly includes material in all of ACM Subject Classes D.2, except that D.2.4 (program verification) should probably have Logics in Computer Science as the primary subject area.'}
}

APA:

Petryshyn, B., & Lukoševičius, M. (2024). Optimizing Large Language Models for OpenAPI Code Completion. arXiv preprint arXiv:2405.15729.

Acknowledgements

This repository is adapted from https://github.com/pacman100/LLM-Workshop, which supports fine-tuning a number of models, including Code Llama. However, a number of problems were encountered when using the original repository with Code Llama. This repository contains improvements like context-level infilling (vs. document-level infilling), usage of correct Code Llama special tokens, among others.

code-llama-fim-fine-tuning's People

Contributors

bohdanpetryshyn avatar mantaslu avatar

Stargazers

hwaking avatar  avatar Heston Snodgrass avatar Oleh Motnyk avatar  avatar  avatar Lance Tipton avatar Eugene Klimov avatar Elina Nechaieva avatar  avatar

Watchers

Kostas Georgiou avatar  avatar

Forkers

citymap mantaslu

code-llama-fim-fine-tuning's Issues

Fail to reproduce results in a published paper

I try to use your hyperparameters to verify your paper, bug the result is differernt from yours. I use these hyperparameters:

--model_name_or_path "..."
--dataset_file_name "..."
--splits "train"
--max_seq_len 4096
--max_steps 1000
--save_steps 100
--eval_steps 100
--logging_steps 1
--log_level "info"
--logging_strategy "steps"
--evaluation_strategy "steps"
--save_strategy "steps"
--push_to_hub False
--bf16 False
--fp16 True
--learning_rate 2e-4
--lr_scheduler_type "cosine"
--weight_decay 0.1
--warmup_ratio 0.05
--max_grad_norm 1.0
--output_dir "fine-tune-19-54"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--gradient_checkpointing True
--use_reentrant True
--dataset_text_field "content"
--test_size 0.1
--fim_rate 0.9
--fim_spm_rate 0.5
--use_peft_lora True
--lora_r 32
--lora_alpha 64
--lora_dropout 0.1
--lora_target_modules "all-linear"
--use_4bit_quantization True
--use_nested_quant True
--bnb_4bit_compute_dtype "bfloat16"
--use_flash_attn False

If you don't mind, can you share the hyperparameters used to train the original Code-Llama?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.