Git Product home page Git Product logo

t5-pegasus-textsummary's Introduction

t5-pegasus-textsummary

使用谷歌2020pegasus模型进行中文文档摘要

谷歌于去年年底发布了一个精简型的机器语义分析项目:飞马(PEGASUS):预先机器学习及训练后的自动文章摘要项目。近期这个项目迎来的新的版本,这个小型项目可以非常精准的自动提取出文章中的摘要,并且只用一千个训练模型就可以生成媲美人类的摘要内容。 利用提取的间隙句进行摘要概括的预训练模型(Pre-training with Extracted Gap-sentences for Abstractive Summarization)。就是设计一种间隙句生成的自监督预训练目标,来改进生成摘要的微调性能。 本repo参考开源论坛对于中文版本的实现,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。

image

##预训练任务

预训练任务模仿了PEGASUS的摘要式预训练。具体来说,假设一个文档有n个句子,我们从中挑出大约n/4个句子(可以不连续),使得这n/4个句子拼起来的文本,跟剩下的3n/4个句子拼起来的文本,最长公共子序列尽可能长,然后我们将3n/4个句子拼起来的文本视为原文,n/4个句子拼起来的文本视为摘要,通过这样的方式构成一个“(原文, 摘要)”的伪摘要数据对。

模型下载

备注: 下面提供的下载链接需要登陆作者公司的vpn才可以使用

  • chinese_t5_pegasus_base.zip 目前开源的T5 PEGASUS是base版,总参数量为2.75亿,训练时最大长度为512,batch_size为96,学习率为10-4,使用6张3090训练了100万步,训练时间约13天,数据是30多G的精处理通用语料,训练acc约47%,训练loss约2.97。模型使用bert4keras进行编写、训练和测试。

运行环境:tensorflow 1.15 + keras 2.3.1 + bert4keras 0.10.0

quick start

train

source activate tensorflow_p36
pip install tensorflow==1.15 keras==2.3.1 bert4keras==0.10.0 jieba tqdm rouge
python finetune.py

训练结束会产生一个keras结构的模型文件 - best_model.weights

预测

下载模型文件,目录结构为

chinese_t5_pegasus_base/ best_model.weights

python test.py

评估

运行 python evaluatiion.py

得到结果 {'rouge-1': 0.885041123153444, 'rouge-2': 0.8795828353099052, 'rouge-l': 0.9046418758557804, 'bleu': 0.8239310846742561}

部署

todo

reference

https://github.com/ZhuiyiTechnology/t5-pegasus

https://github.com/google-research/pegasus

t5-pegasus-textsummary's People

Contributors

jackie930 avatar haoranlv avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.