Git Product home page Git Product logo

marcusm117 / identitychain Goto Github PK

View Code? Open in Web Editor NEW
6.0 3.0 0.0 1.71 MB

[ICLR 2024] Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Home Page: https://arxiv.org/abs/2310.14053

License: Apache License 2.0

Makefile 2.20% Python 97.74% Shell 0.06%
code-generation code-summarization deep-learning evaluation large-language-models model-evaluation iclr iclr2024

identitychain's Introduction

IdentityChain

PyPI CI License Issues

The IdentityChain Framework for Code Large Language Models (Code LLMs) Evaluation. Official implementation of the ICLR 2024 paper Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain.

The IdentityChain Framework evaluates the NL-to-PL (Code Generation) Accuracy, PL-to-NL (Code Summurization) Accuracy, and the Self-Consistency across the two tasks. It also provides a fine-grained analysis of the model's performance so that you can pinpoint the exact step and problem where the model makes a self-inconsistency violation.

image

Installation

Create and Activate a Conda Environment.

conda create -n idchain python=3.10
conda activate idchain

Install from PyPI with all Dependencies.

pip3 install identitychain
pip3 install -r requirements.txt

Install from Source with all Dependencies.

git clone https://github.com/marcusm117/IdentityChain.git
cd IdentityChain
make develop

Usage

Before the self-consistency evaluation, you need to make sure that one of the followings is satisfied:

  1. Your model is an Instruction-tuned Code LLM, and it's trained on both NL-to-PL and PL-to-NL tasks.
  2. Your model is a Foundation Code LLM, and it's trained on both completion and fill-in-the-middle tasks.

To evaluate your model using IdentityChain, you need to prepare the followings:

  1. An evaluation dataset from one of the followings (or one of your own in the same format):
  2. An NL-to-PL prompt for your model
  3. A PL-to-NL prompt for your model
  4. An NL-to-PL generation function for your model
  5. A PL-to-NL generation function for your model

See run_identity_chain_openai.py for an example of how to use IdentityChain to evaluate OpenAI models.

See run_identity_chain_google.py for an example of how to use IdentityChain to evaluate Google models.

See run_identity_chain_huggingface.py for an example of how to use IdentityChain to evaluate HuggingFace open-source models. This example script already includes the following models:

  1. CodeLlama-Instruct-hf (7B, 13B, 34B, 70B)
  2. CodeLlama-hf (7B, 13B, 34B, 70B)
  3. StarChat-Beta
  4. StarCoder
  5. StarCoderPlus
  6. StarCoderBase (1B, 3B, 7B, 15B)
  7. DeepSeekCoder-Instruct (1.3B, 6.7B, 33B, 7B-v1.5)
  8. DeepSeekCoder (1.3B, 6.7B, 33B, 7B-v1.5)

Example

Use run_identity_chain.sh to execute scripts run_identity_chain_openai.py or run_identity_chain_huggingface.py, which conducts several IdentityChain evaluation in a batch. Make sure that you modify the followings before running the script:

  1. export CUDA_VISIBLE_DEVICES=0 to specify the local GPU device you want to use
  2. export HF_HOME=YOUR_OWN_PATH/huggingface to specify your own huggingface home path, where the model checkpoints will be cached
  3. export IDENTITY_CHAIN_HOME=YOUR_OWN_PATH/IdentityChain to your own IdentityChain home path
  4. other parameters in the script for your own needs

Then run the script:

cd examples
bash run_identity_chain.sh

This script will create a temporary folder tmp under your IdentityChain home path, and store the results of IdentityChain evaluation in this folder, which will be a jsonl file. For example, tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl.

Use analyze_results.py to analyze the results of IdentityChain evaluation. It will geneartes an xlsx file, which contains the following information:

  1. The SC (Self-Consistency) and SSC (Strong Self-Consistency) scores of the model at each self-iteration step. Note that SSC_0 is just Pass@1
  2. The aggregated TOM score (also BLEU and CodeBLEU) information at each step for the following 4 types of resulsts: Pass-Pass, Pass-Fail, Fail-Fail, Fail-Pass
  3. The TOM score (also BLEU and CodeBLEU) trajectory at each self-iteration step for each sample in the eavluation set.
  4. The raw test case outputs at each self-iteration step
cd ../scripts
python analyze_results.py --input_path ../tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl --chain_length 5

The analyzed results will give you a sense of the model's overall performance, and the TOM score trajectory will help you pinpoint the exact step where the model makes a mistake.

Use browse_results.py to browse the results of IdentityChain evaluation. You can use this script to manually examine and study the mistakes made by the model for specific samples.

cd ../scripts
python browse_results.py --input_path ../tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl --chain_length 5 --start 0

Linting & Testing

We use a Makefile as a command registry:

  • make format: autoformat this library with black
  • make lint: perform static analysis of this library with black and flake8
  • make annotate: run type checking using mypy
  • make test: run automated tests
  • make check: check assets for packaging

Make sure that make lint, make test, and make check all pass locally before submitting a Pull Request.

identitychain's People

Contributors

marcusm117 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.