Git Product home page Git Product logo

csmat's Introduction

基于代码语句掩码注意力机制的源代码迁移模型

本研究提出了CSMAT(Code-Statement Masked Attention Transformer)。

论文已被《计算机系统应用录用》录用
论文名:基于代码语句掩码注意力机制的源代码迁移模型
作者:徐明瑞 李征 刘勇 吴永豪
论文引用格式将在正式发表后更新

CSMAT编码器注意力的连接样式和掩码矩阵: 编码器注意力连接样式和掩码矩阵

CSMAT译码器自注意力的连接样式和掩码矩阵: 编码器自注意力连接样式和掩码矩阵

CSMAT译码器跨层注意力的连接样式和掩码矩阵: 译码器跨层注意力的连接样式和掩码矩阵

本项目的运行环境:

python 3.9
pytorch 1.13.1
transformers 4.25.1
tree-sitter 0.20.1
scipy 1.9.3

其中tree-sitter用于验证词法正确率,需要参考python tree-sitter教程生成my-languages.so文件

目前项目内容仍在调整中

本项目给出了以下模型的代码和结果:

测试集参考代码

需要注意的是,Astyle工具格式化代码后和源代码存在空格符格式上的差异。
因此为了统一输出,references给出了参考代码,用于评估结果。

测试集参考代码见 references 文件夹:

  • c#:output.cs
  • java: output.java

实验模型

文件夹名称 模型名称
transformer_large Transformer
transformer_large_loc_tok LOC-Transformer
transformer_large_loc_enc LOC-Transformerenc
transformer_large_loc_enc_qua_dec LOC-transformerenc-1/4dec
transformer_large_loc_enc_half_dec CSMAT(LOC-transformerenc-1/2dec)
transformer_large_loc_enc_3qua_dec LOC-transformerenc-3/4dec

改进模型

文件夹名称 模型名称
codebert CodeBERT
codebert_loc_enc LOC-CodeBERTenc
codebert_loc_enc_half_dec LOC-CodeBERTenc-1/2dec
graphcodebert GraphCodeBERT
graphcodebert_loc_enc LOC-GraphCodeBERTenc
graphcodebert_loc_enc_half_dec LOC-GraphCodeBERTenc-1/2dec

模型项目的内容 和 运行项目

文件/文件夹 简介
dataset 数据集(train.json, valid.json, test.json)
model 保存的模型以及输出结果
tokenizer 引入<loc>词符的Roberta分词器
model.py 模型文件
run.py 执行文件
train.sh shell脚本

若需自定义并运行项目,修改train.sh脚本中的模型参数后,运行即可:

1. 新建环境(conda)

conda create -n CSMAT python=3.9

conda activate CSMAT

pip install torch transformers tree-sitter scipy

2. 选择模型并修改模型参数

CodeBERT为例

cd codebert

vim train.sh

3. 运行模型

source train.sh

评估指标

BLEU、完全匹配率(EM)、CodeBLEU见CodeTrans项目,词法正确率见detect_problem_output.py参考代码。

实验结果(非预训练模型)

java -> c#

模型 BLEU 完全匹配率 CodeBLEU 词法正确率
Naive 18.54 0 - -
Pointer-Generator 26.18 13.8 43.87 48.5
Tree-to-tree 36.34 3.4 42.13 45.6
Transformer 60.99 37.9 66.88 89.0
LOC-Transformer 60.39 37.5 66.49 88.7
LOC-Transformerenc 62.53 38.8 68.42 89.2
LOC-Transformerenc-1/4dec 62.22 38.9 68.09 90.1
CSMAT(LOC-Transformerenc-1/2dec) 62.74 39.5 67.82 90.2
LOC-Transformerenc-3/4dec 61.81 38.1 67.86 90.1

c# -> java

模型 BLEU 完全匹配率 CodeBLEU 词法正确率
Naive 18.69 0 - -
Pointer-Generator 27.84 20.5 44.88 48.6
Tree-to-tree 32.09 4.4 43.86 65.2
Transformer 55.41 40.6 62.20 89.2
LOC-Transformer 55.84 40.8 62.31 89.0
LOC-Transformerenc 58.80 42.4 64.64 89.3
LOC-Transformerenc-1/4dec 58.64 41.4 64.78 88.4
CSMAT(LOC-Transformerenc-1/2dec) 59.09 41.5 65.14 89.4
LOC-Transformerenc-3/4dec 57.59 39.4 63.71 88.4

实验结果(预训练模型)

java -> c#

模型 BLEU 完全匹配率 CodeBLEU 词法正确率
CodeBERT 77.55 52.7 80.69 94.7
LOC-CodeBERTenc 76.73 54.2 80.22 93.5
LOC-CodeBERTenc-1/2dec 77.46 53.2 80.51 95.4
GraphCodeBERT 78.84 55.1 81.16 94.0
LOC-GraphCodeBERTenc 78.85 55.0 81.84 95.7
LOC-GraphCodeBERTenc-1/2dec 77.90 54.1 80.75 94.8

c# -> java

模型 BLEU 完全匹配率 CodeBLEU 词法正确率
CodeBERT 73.57 55.5 77.67 95.4
LOC-CodeBERTenc 73.25 57.7 77.28 94.9
LOC-CodeBERTenc-1/2dec 74.47 57.3 78.48 95.4
GraphCodeBERT 75.25 58.3 78.42 94.4
LOC-GraphCodeBERTenc 75.83 60.0 79.58 94.9
LOC-GraphCodeBERTenc-1/2dec 73.32 57.1 77.45 94.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.