Run watermarking using detector_main.py

Installation

This repo runs based on code-eval (https://github.com/abacaj/code-eval) and evalplus (https://github.com/evalplus/evalplus). Installing dependencies:

pip install -r requirements.txt

pip install -r evalplus/requirements.txt

pip install -r evalplus/requirements-llm.txt

pip install -r evalplus/requirements-tools.txt

pip install -r evalplus/requirements-tsr.txt

Docker:

Docker image llm-watermark at server: 10.0.104.137:

docker run --name giang-llm-watermark -it --rm --gpus all -v /mnt/hdd2/gtnguyen:/work llm-watermark

Running non-watermarked code generation:

Currently support 3 models: bigcode/santacoder, NinedayWang/PolyCoder-2.7B, codellama/CodeLlama-7b-hf

Command example:

mkdir tmp

python3 evalplus_santacoder.py --use_watermark False --out_path tmp/santacoder_no_watermark

python3 evalplus_polycoder.py --use_watermark False --out_path tmp/polycoder_no_watermark

python3 eval_codellama.py --use_watermark False --out_path tmp/code_lamma_no_watermark

The result is a JSON file contains list of task_id and completion.

Implementation explanation:

The core workflow for code generation is at core/evaluation.py, function run_eval(). Either evalplus_santacoder.py, evalplus_polycoder.py, eval_codellama.py will call to run_eval(). Different models will have different ways and arguments to generate code, which is implemented in the function generate_batch_completion() and passed to run_eval().
For santacoder and polycoder, the implementation is in codegen/model.py, class SantaCoder and HFTorchDecoder.

Running watermarked code generation, detector, and accuracy calculation:

The main implementation is at detector_main.py

Command example: python detector_main.py --model_name bigcode/santacoder --gamma 0.5 --delta 10 --gpu 3 --no_wtm_path results/evalplus_santacoder_no_watermark --dataset_name human_eval

model_name: either bigcode/santacoder, NinedayWang/PolyCoder-2.7B, codellama/CodeLlama-7b-hf
gamma, delta: parameters for watermark algorithm
gpu (optional): designate which gpu to run on
no_wtm_path: path to folder that contains result (file eval.jsonl) for no-watermarked generated code
dataset_name: must be either human_eval or mbpp

Some details:

The command will generate the watermarked code corresponding to the pair (gamma, delta), store to results/result_{model_name}_dataset_{dataset_name}_watermark_pass_10_{gamma}_{delta}. If this folder already exists, detector_main.py will skip the generation part. So, in order to re-run the generation part, you must delete the result folder, or move it to different directory (rather than results)
After watermarked code generation, the next process is watermark detection. In this step, WatermarkDetector (from extended_watermark_processor) will detect watermarks from 3 sources: watermarked generated code, non-watermarked generated code, and groundtruth (code given in the dataset). We use Accuracy and False Positive Rate for evaluation.

A few notes for parameters:

For code generation:
top_p: 0.95
temperature: 0.2
do_sample: True
For watermark detector:
seeding_scheme: the watermark detector provides different seeding_scheme for detection algorithm (check alternative_prf_schemes.py). Different schemes require different context_width, however, currently we use lefthash (or simple_1) because there are some situations, the generated code is very short.

To calculate pass-k, please refer to repo code-eval (for humaneval dataset) and evalplus (for mbpp dataset) for commmands

Example and notes:

Options for argument model_name (currently): bigcode/santacoder, NinedayWang/PolyCoder-2.7B, codellama/CodeLlama-7b-hf

Note for value of no_wtm_path for specific LLM:

For santacoder: results/evalplus_santacoder_no_watermark_3

For polycoder: results/evalplus_polycoder_no_watermark

For codellama-7B: results/eval_codellama_no_watermark_4122023

Baselines:

There are 3 baselines (DetectGPT, GLTR, GPT2-OuputDetector), which implementations are from paper Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education (ICSE 2024). Data and experimental results are stored in folder aigc_data.

The implementation for the baselines is available in the paper. Due to some heavy files and the experimental results can be easily replicated using the provided code, I do not upload the implementation to this repo.

code-eval

What

This is a repo I use to run human-eval on code models, adjust as needed. Some scripts were adjusted from wizardcoder repo (process_eval.py). The evaluation code is duplicated in several files, mostly to handle edge cases around model tokenizing and loading (will clean it up).

Results

Table is sorted by pass@1 score.

model	size	pass@1	pass@10	screenshot
sahil2801/replit-code-instruct-glaive	3B	63.5%	67%
WizardCoder-15B-V1.0	15B	57%	68.9%
bigcode/starcoder	15B	34.6%	48.7%
openchat/opencoderplus	15B	27.3%	43.9%
teknium/Replit-v1-CodeInstruct-3B	3B	25.8%	42.6%
teknium/Replit-v2-CodeInstruct-3B	3B	21.5%	31%
replit-code-v1-3b	3B	17.1%	29.8%
mpt-7b	7B	15.9%	23.7%
xgen-7b-8k-base	7B	14.9%	22.5%
openllama-7b-v2	7B	14%	23.1%
llama-2-7b	7B	13.1%	21.9%
llama-7b	7B	12.1%	18.9%
mpt-30b	30B	pending	pending	pending

FAQ

Why is there a discrepancy on some of the scores between official numbers?

Because it is not obvious or published what prompt or processing the official models used to conduct their evaluation on this benchmark. The goal here is to try and best reproduce those numbers, in many cases it is possible to get very close to the published numbers.

All of the scores here were run independently of any published numbers and are reproducible by cloning the repo and following the setup.

Why do some models have a filter_code post generation step?

Base models can in many cases repeat outputs, breaking the benchmark scores. Instruct models don't have this problem and so you won't see this step, they tend to output a end of sequence token.

Setup

Create python environment

python -m venv env && source env/bin/activate

Install dependencies

pip install -r requirements.txt

Run the eval script

# replace script file name for various models:
# eval_wizard.py
# eval_opencode.py
# eval_mpt.py
# eval_starcoder.py
# eval_replit.py
# eval_replit_glaive.py
# eval_replit_instruct.py

python eval_wizard.py

Process the jsonl file to extract code samples from model completions.

Note: Only wizard & opencoder require this, they return markdown output with code.

# replace args for various models:
# --path results/wizard --out_path results/wizard/eval.jsonl
# --path results/opencode --out_path results/opencode/eval.jsonl

python process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_prompt

Then get the results

# replace args for various models:
# results/wizard/processed.jsonl
# results/starcoder/eval.jsonl
# results/mpt/eval.jsonl
# results/opencode/processed.jsonl
# results/replit_instruct/eval.jsonl
# results/replit_glaive/eval.jsonl
# results/replit/eval.jsonl

evaluate_functional_correctness results/wizard/processed.jsonl

ntgiang71096 / code_eval_wtm Goto Github PK

code_eval_wtm's Introduction

Run watermarking using detector_main.py

Installation

Docker:

Running non-watermarked code generation:

Command example:

Implementation explanation:

Running watermarked code generation, detector, and accuracy calculation:

Example and notes:

Baselines:

code-eval

What

Results

FAQ

Setup

code_eval_wtm's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org