thudm / glm Goto Github PK

GLM (General Language Model)

License: MIT License

Python 92.63% Shell 4.89% Perl 0.48% Dockerfile 2.01%

glm's Introduction

GLM

GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.

Please refer to our paper for a detailed description of GLM:

GLM: General Language Model Pretraining with Autoregressive Blank Infilling (ACL 2022)

Zhengxiao Du*, Yujie Qian*, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, Jie Tang (*: equal contribution)

News: We release ChatGLM-6B, an open pre-trained language model with 6 billion parameters optimized for Chinese QA and dialogue based on the GLM framework.

Pretrained Models

You can download the pretrained models used in the paper from OneDrive or Tsinghua-Cloud.

Name	Params	Language	Corpus	Objective	File	Config
GLM-Base	110M	English	Wiki+Book	Token	glm-base-blank.tar.bz2	model_blocklm_base.sh
GLM-Large	335M	English	Wiki+Book	Token	glm-large-blank.tar.bz2	model_blocklm_large.sh
GLM-Large-Chinese	335M	Chinese	WuDaoCorpora	Token+Sent+Doc	glm-large-chinese.tar.bz2	model_blocklm_large_chinese.sh
GLM-Doc	335M	English	Wiki+Book	Token+Doc	glm-large-generation.tar.bz2	model_blocklm_large_generation.sh
GLM-410M	410M	English	Wiki+Book	Token+Doc	glm-1.25-generation.tar.bz2	model_blocklm_1.25_generation.sh
GLM-515M	515M	English	Wiki+Book	Token+Doc	glm-1.5-generation.tar.bz2	model_blocklm_1.5_generation.sh
GLM-RoBERTa	335M	English	RoBERTa	Token	glm-roberta-large-blank.tar.bz2	model_blocklm_roberta_large.sh
GLM-2B	2B	English	Pile	Token+Sent+Doc	glm-2b.tar.bz2	model_blocklm_2B.sh
GLM-10B	10B	English	Pile	Token+Sent+Doc	Download	model_blocklm_10B.sh
GLM-10B-Chinese	10B	Chinese	WuDaoCorpora	Token+Sent+Doc	Download	model_blocklm_10B_chinese.sh

Unzip the downloaded file into a local folder and set CHECKPOINT_PATH in the corresponding scripts to the folder path.

Results

SuperGLUE

dev set, single model, single-task finetuning

Model	COPA	WSC	RTE	WiC	CB	MultiRC	BoolQ	ReCoRD
GLM-10B	98.0	95.2	93.1	75.7	98.7/98.2	88.1/63.3	88.7	94.4/94.0
DeBERTa-XXLarge-v2	97.0	-	93.5	-	-	87.8/63.6	88.3	94.1/93.7

Seq2Seq

CNN/Daily Mail (test set, no additional data used)

Model	ROUGE-1	ROUGE-2	ROUGE-L
GLM-10B	44.7	21.4	41.4
T5-11B	43.5	21.6	40.7
PEGASUS-Large	44.2	21.5	41.4
BART-Large	44.2	21.3	40.9

XSum (test set, no additional data used)

Model	ROUGE-1	ROUGE-2	ROUGE-L
GLM-10B	48.9	25.7	40.4
PEGASUS-Large	47.2	24.6	39.3
BART-Large	45.1	22.3	37.3

Language Modeling

test set, zero-shot

Model	LAMBADA (accuracy)	Wikitext103 (perplexity)
GLM-10B (bi)	72.35	11.33
GLM-10B (uni)	67.18	12.22
GPT-2	52.66	17.48
Megatron-LM (8.3B)	66.51	10.81
Turing-NLG	67.98	10.21

Get Started

Hugging Face Hub

You can access GLM models via HuggingFace Hub. Please install transformers>=4.23.1 and find all the available models here.

Generation

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = model.half().cuda()
model.eval()

# Inference
inputs = tokenizer("Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.", return_tensors="pt")
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=512)
inputs = inputs.to('cuda')
outputs = model.generate(**inputs, max_length=512, eos_token_id=tokenizer.eop_token_id)
print(tokenizer.decode(outputs[0].tolist()))

# Training
inputs = tokenizer(
    ["Tsinghua University is located in [MASK].", "One minus one equals zero, is it correct? Answer: [MASK]"],
    return_tensors="pt", padding=True)
inputs = tokenizer.build_inputs_for_generation(inputs, targets=["Beijing", "No"], max_gen_length=8, padding=False)
inputs = inputs.to('cuda')
outputs = model(**inputs)
loss = outputs.loss
logits = outputs.logits

Classification

from transformers import AutoTokenizer, AutoModelForMultipleChoice
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = AutoModelForMultipleChoice.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = model.half().cuda()
model.eval()

inputs = tokenizer(["Tsinghua University is located in [MASK].",
                    "One minus one equals zero, is it correct? Answer: [MASK]"], return_tensors="pt", padding=True)
choices = [["Beijing", "Shanghai"], ["Yes", "No"]]
inputs = tokenizer.build_inputs_for_multiple_choice(inputs, choices)
inputs = inputs.to('cuda')
outputs = model(**inputs)
logits = outputs.logits

You can also convert the finetuned checkpoints with scripts/convert_glm_checkpoint_to_transformers.py.

Docker Image

We prepare two docker images based on CUDA 10.2 and CUDA 11.2. You can pull the pre-built images from Docker Hub and run with docker v19.03+

docker run --gpus all --rm -it --ipc=host zxdu20/glm-cuda102

or replace glm-cuda102 with glm-cuda112.

You can also modify the image according to your requirements in docker/cuda102.dockerfile and build the image yourself

  docker build -f cuda102.dockerfile . -t glm-cuda102

Manual Installation

Please first install PyTorch (we use 1.7.0) and apex, and then install other dependencies by pip install -r requirements.txt

Clone this repo

git clone https://github.com/THUDM/GLM
cd GLM

Model Parallelism

If your encounter the CUDA out of memory error, which means you GPU memory is limited, you can try the model parallelism to divide the parameters into multiple GPUs. Take the two-way model parallelism as an example. First run change_mp.py to divide the checkpoint:

python change_mp.py path_to_the_checkpoint 2

Then update the checkpoint path in the model config file (such as config_tasks/model_blocklm_10B.sh) and change MP_SIZE in the script (such as scripts/ds_finetune_superglue.sh) to 2.

Usage

We provide scripts for finetuning GLM on some downstream tasks.

Left-to-Right Generation / Blank Filling (Interactive)

Change CHECKPOINT_PATH to your local path. Run the following script

bash scripts/generate_block.sh \
     config_tasks/model_blocklm_10B_chinese.sh

Some models (GLM-2B, GLM-10B, and GLM-10B-Chinese) use three different mask tokens: [MASK] for short blank filling, [sMASK] for sentence filling, and [gMASK] for left-to-right generation.

Examples

Usage of `[MASK]` (Entity Prediction):

Example1

Context: Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.

GLM: the stanford university

Example2 (Chinese)

Context: 凯旋门位于意大利米兰市古城堡旁。1807年为纪念[MASK]而建，门高25米，顶上矗立两武士青铜古兵车铸像。

GLM:拿破仑军队攻克米兰城

Usage of `[sMASK]` (Sentence Prediction)

Example3

Context: There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). [sMASK] We propose a General Language Model ( GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25× parameters of BERT Large, demonstrating its generalizability to different downstream tasks.

GLM: However, there is a growing need to develop a single pretraining model that is not only good at natural language understanding (NLU) or dialog generation/generation (dialog), but is also able to predict other tasks such as sentiment analysis, conditional generation, or machine translation (MT).

Example4 (Chinese)

Context: 工业互联网（Industrial Internet）是新一代信息通信技术与工业经济深度融合的新型基础设施、应用模式和工业生态，通过对人、机、物、系统等的全面连接，构建起覆盖全产业链、全价值链的全新制造和服务体系，为工业乃至产业数字化、网络化、智能化发展提供了实现途径，是第四次工业革命的重要基石。[sMASK] 它以网络为基础、平台为中枢、数据为要素、安全为保障，既是工业数字化、网络化、智能化转型的基础设施，也是互联网、大数据、人工智能与实体经济深度融合的应用模式，同时也是一种新业态、新产业，将重塑企业形态、供应链和产业链。当前，工业互联网融合应用向国民经济重点行业广泛拓展，形成平台化设计、智能化制造、网络化协同、个性化定制、服务化延伸、数字化管理六大新模式，赋能、赋智、赋值作用不断显现，有力的促进了实体经济提质、增效、降本、绿色、安全发展。

GLM: 工业互联网是制造业技术、管理、模式的重大变革,是推动互联网、大数据、人工智能和实体经济深度融合的重要载体,是建设制造强国和网络强国的重要基础。

Usage of `[gMASK]` (Long Text Generation)

Example5 (Chinese)

Context: 问题：冬天，**哪座城市最适合避寒？问题描述：能推荐一些国内适合冬天避寒的城市吗？回答用户：旅游爱好者回答： [gMASK]

GLM: 海南三亚,三亚的冬天是三亚的旺季,因为冬天国内大部分地方都有雾霾,而三亚却是没有雾霾的地方,所以三亚是冬天最惬意不过的地方了。在东北长大的我觉得三亚简直就是一个天堂,冬天去海口三亚旅游,享受冬天,享受阳光沙滩。但是海口却是很干燥,需要多喝水。三亚冬天最热门的玩法就是晒太阳,在海边晒晒太阳,在沙滩上晒晒太阳,感觉整个人都得到了解放。三亚还有一个特色项目,就是海上冲浪,在三亚的沙滩上冲浪也是一件非常刺激的事情。海口,海口冬季的阳光十分温暖,海南的冬季也是属于冬季旅游的旺季。冬季的海口最棒的是去海南的热带野生动植物园,那里有数之不尽的热带小动物,在这里可以近距离的和它们接触,海南的热带野生动植物园也是海南的天然氧吧。还可以在海口观澜湖公园里感受海口美丽的海景。贵阳,贵州的冬天也是十分温暖的,贵阳也是冬季避寒很好的城市之一。冬季去贵阳玩一定要去黔灵山,黔灵山是贵州香火很旺盛的一个寺庙,寺庙的冬季香火鼎盛,在冬季去寺庙游玩也是一个很好的体验。除了黔灵山,贵阳在冬季还有花溪公园可以去玩,花溪公园也是去当地公园玩最好的选择。青岛,青岛的冬天是青岛最舒服的时候,青岛有很多海滨浴场,冬天去海边泡一泡温泉,然后晒晒太阳是一件十分惬意的事情。青岛也有沙滩,冬天在沙滩上晒晒太阳,看看海,再玩玩沙滩游戏,感觉十分快乐的事。

You can also add multiple [MASK] and [sMASK] in a single example. The model will fill the blanks one by one from left to right. The answer to each blank always begins with a special <|startofpiece|>.

Examples

Example1

Context: There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and [MASK] (e.g., T5). [sMASK] We propose a General Language Model ( GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over [MASK] on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and [MASK], GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25× parameters of BERT Large , demonstrating its generalizability to different downstream tasks.

Example2 (Chinese)

Context: 工业互联网（Industrial Internet）是新一代[MASK]与[MASK]深度融合的新型基础设施、应用模式和工业生态，通过对人、机、物、系统等的全面连接，构建起覆盖全产业链、全价值链的全新制造和服务体系，为工业乃至产业数字化、网络化、智能化发展提供了实现途径，是第四次工业革命的重要基石。[sMASK] 它以网络为基础、平台为中枢、数据为要素、安全为保障，既是工业数字化、网络化、智能化转型的基础设施，也是互联网、大数据、人工智能与实体经济深度融合的应用模式，同时也是一种新业态、新产业，将重塑企业形态、供应链和产业链。当前，工业互联网融合应用向国民经济重点行业广泛拓展，形成[MASK]、智能化制造、[MASK]、个性化定制、服务化延伸、数字化管理六大新模式，赋能、赋智、赋值作用不断显现，有力的促进了实体经济提质、增效、降本、绿色、安全发展。

SuperGLUE

Download the SuperGlue data and check the experiment setup in scripts/ds_finetune_superglue.sh. Note that DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH need to be changed to your local path. You may also change the batch-size and nproc_per_node according to your available hardware.
Run the following script (use the COPA dataset as an example)

bash scripts/ds_finetune_superglue.sh \
     config_tasks/model_blocklm_10B.sh \
     config_tasks/task_copa.sh

We also implement P-Tuning in our code. Run the following script to integrate p-tuning:

bash scripts/ds_finetune_superglue_prompt.sh \
     config_tasks/model_blocklm_10B.sh \
     config_tasks/task_copa.sh

To apply GLM to a new NLU dataset with cloze-filling finetuning, implement a DataProcessor in tasks/superglue/dataset.py for data loading and add a PVP in tasks/superglue/pvp.py for the cloze question. More details can be found here.

Seq2Seq

Download the Gigaword , CNN/Daily Mail or XSum dataset and check the experiment setup in scripts/ds_finetune_seq2seq.sh. Change DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH to your local path.

Run the following script (use the CNN/Daily Mail dataset as an example)

bash scripts/ds_finetune_seq2seq.sh \ 
   config_tasks/model_blocklm_10B.sh \ 
   config_tasks/seq_cnndm_org.sh

The summaries are written into ./runs/experiment_name/test.jsonl.hyps. The references are written into test.jsonl.refs in the same directory. For calculating rouge, install file2rouge and download Stanford CoreNLP from here. Run the following script
```
bash scripts/evaluate_seq2seq.sh \
 ./runs/experiment_name/test.jsonl.hyps ./runs/experiment_name/test.jsonl.refs
```

Train with your own data

Process your seq2seq data into {split}.source and {split}.target, with each line being the context or the target of a sample, and split being train, val, and test.

Run the following script

bash scripts/ds_finetune_seq2seq.sh \ 
   config_tasks/model_blocklm_10B.sh \ 
   config_tasks/seq_customization.sh

You can specify the hyperparameters in config_tasks/seq_customization.sh and config_tasks/config_blocklm_10B_cnndm.json

Multiple Choice (Zero-shot)

bash scripts/evaluate_multichoice.sh config_tasks/model_blocklm_10B.sh

Note that CHECKPOINT_PATH and DATA_PATH need to be changed to your local path.

The format of each line of the data file should be

{"inputs_pretokenized": "Context and question here", "choices_pretokenized": ["Choice 1", "Choice 2", "Choice 3"], "label": int}

Language Modeling

LAMBADA Cloze Accuracy

Download the LAMBADA data and change DATA_ROOT, CHECKPOINT_PATH in scripts/evaluate_lm.sh
Run the following script

bash scripts/evaluate_lm.sh \ 
     config_tasks/model_blocklm_large_generation.sh \
     config_tasks/zero_lambada.sh

LM Perplexity

Download our test set of wikibook or Wikitext103 dataset and change DATA_ROOT, CHECKPOINT_PATH in scripts/evaluate_lm.sh

Run the following script

bash scripts/evaluate_lm.sh \ 
   config_tasks/model_blocklm_large_generation.sh \
   config_tasks/zero_wikitext.sh

Text Infilling

Download the Yahoo dataset and check the experiment setup in scripts/finetune_blank.sh. Change DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH to your local path.
Run the following script

bash scripts/finetune_blank.sh \ 
     config_tasks/model_blocklm_large.sh \ 
     config_tasks/seq_blank.sh

Pretrain

Run the following script to pre-train the GLM-Large model

bash scripts/ds_pretrain_nvidia.sh config/ds_block_large.sh

The script scripts/ds_pretrain_nvidia.sh launches the training program with DeepSpeed. You should change NUM_WORKERS and NUM_GPUS_PER_WORKER to the number of workers and the number of gpus per worker. Also change HOST_FILE_PATH to the path to an OpenMPI-style hostfile. More details about DeepSpeed launcher can be found here.

The file config/ds_block_large.sh defines the hyperparameters for pretraining. Most of the arguments are fairly self-explanatory. Specifically, --train-data can be multiple keywords defined in NAMED_CORPORA in data_utils/corpora.py. The hyperparameters of the optimizer are defined in the corresponding json file under config. The semantics of the json file can be found here.

Citation

Part of the code is based on Megatron-LM and PET.

Please cite our paper if you find this code useful for your research:

@article{DBLP:conf/acl/DuQLDQY022,
  author    = {Zhengxiao Du and
               Yujie Qian and
               Xiao Liu and
               Ming Ding and
               Jiezhong Qiu and
               Zhilin Yang and
               Jie Tang},
  title     = {{GLM:} General Language Model Pretraining with Autoregressive Blank Infilling},
  booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational
               Linguistics (Volume 1: Long Papers), {ACL} 2022, Dublin, Ireland,
               May 22-27, 2022},
  pages     = {320--335},
  publisher = {Association for Computational Linguistics},
  year      = {2022},
}

glm's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory puraminy zxhhh97 abcyzj huangyf530 xiao9905 zhangjiekui wxj630 cappeta spatil6 reactivetype baai-wudao trendingtechnology haoyunhong ybz79 xiaoanshi maxylee hhhercules wwy2020211 comydream jinlmsft garfieldgzhh paddlelaw wangguojim fazarafi merouone davisli2013 hoangphuc0611 shunxing1234 suparklingmin marscrazy ntaylorox hdt91 kunlun-zhu jianqiang xizil antecede engrmhanif aitsc tianjianl qfxlcyc shangqingtu mandoxzhang kolakows shawn-guo-cn wangyc-99 wrran judelee19 xi-studio zhao9797 xv44586 larrylawl goooooooooooooooogle trellixvulnteam biandh ccssu rosssong chiang97912 wakafengfan henry-zeng zetimente macguyversmusic atfortes yuwenmichael hadryan huangyanhui haiqizhang reign12 liku-amare xialuxi imahamat fengzhilei liuyongs1 hepingtechan moomoofarm1 autogyro enockipp hit-computer pandagst derekliu-hz lincgcg xujing1022 fpzh2011 qiwang321 imbing phellonchen wangchangh shanghaiqiguang qyxing kingqn0321 andy-hhh-hub eternalimmortal jangocheng watchings 13390778668 jaedukseo co-simulation ai-framwork felixonmars goswamig

glm's Issues

Hi, i got two points.

the first is using GLM as constrained text generation, such as [CLS] [M1][x3][M2][x5][M3][S][x6][S][x1, x2][S][x4]. i recommend {Lexically Constrained Neural Machine Translationwith Levenshtein Transformer} AND {POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training} AS a contrast.
第一点就是用于硬约束文本生成, 仅通过少量限制词, 去预测大量序列片段, 生成全局通顺的句子.
to promote the model's performance, I suggest using [NOI] as nothing to insert to join in training. eg [CLS][x1][x2][M1][x3][x4][S][NOI]
通过在连续片段中插入对应[NOI]的[M] 让模型学习到什么情况下这个片段是恰当的,不需要插入的. 因为这个model很可能面临过度生成的问题, 即太多相似的输入对应不同的收敛条件(条件生成模型的通病) 所以不知道怎样去停止.
Expand the application in NMT
我觉得这个模型在机器翻译领域还是有可能去应用的不过我还没想好怎么做 =.=
如果仅仅是 [cls][src][s][tgt]感觉又和MASS之类的有点重合了想听下大佬的看法.

配置问题

what do these parameters mean？？？？

Google Colab error

Basic code:

!git clone https://github.com/THUDM/GLM
%cd GLM
!pip install -r requirements.txt
!pip install apex

modify model_path inside generate_block.sh here ; I'm using glm-1.5-generation.tar.bz2

!chmod 755 scripts/generate_block.sh
!scripts/generate_block.sh config_tasks/model_blocklm_10B_chinese.sh

Error Log:

Traceback (most recent call last):
File "generate_samples.py", line 23, in
from arguments import get_args
File "/content/GLM/arguments.py", line 23, in
from utils import get_hostname
File "/content/GLM/utils.py", line 26, in
from fp16 import FP16_Optimizer
File "/content/GLM/fp16/init.py", line 15, in
from .fp16util import (
File "/content/GLM/fp16/fp16util.py", line 21, in
import mpu
File "/content/GLM/mpu/init.py", line 35, in
from .layers import ColumnParallelLinear
File "/content/GLM/mpu/layers.py", line 28, in
from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm
File "/usr/local/lib/python3.7/dist-packages/apex/init.py", line 13, in
from pyramid.session import UnencryptedCookieSessionFactoryConfig
ImportError: cannot import name 'UnencryptedCookieSessionFactoryConfig' from 'pyramid.session' (unknown location)

EOFError: Ran out of input

My GPU is RTX3090 * 2, memory is 256g, and CPU is Intel (R) Xeon (R) gold 6230 CPU @ 2.10GHz. When I use the docker image glm-cuda112 provided to run GLM to train SuperGLUE-COPA, the following error occurs,

WARNING: could not find the metadata file /root/data/checkpoints/blocklm-base-blank/latest_checkpointed_iteration.txt
Try to directly load the checkpoint from the directory
global rank 0 is loading pretrained model /root/data/checkpoints/blocklm-base-blank/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "finetune_glm.py", line 469, in
main(args)
File "/workspace/GLM-main/tasks/superglue/finetune.py", line 100, in main
finetune(args, train_valid_datasets_provider, model_kwargs,
File "/workspace/GLM-main/finetune_glm.py", line 379, in finetune
load_pretrained(model, args.load_pretrained, args, task_tokens=task_tokens)
File "/workspace/GLM-main/train_utils.py", line 23, in load_pretrained
sd = torch.load(checkpoint_name, map_location='cpu')
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 762, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
issue.txt
EOFError: Ran out of input
Killing subprocess 28691
Killing subprocess 28692
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in
main()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'finetune_glm.py', '--local_rank=1', '--deepspeed', '--deepspeed_config', 'config_tasks/config_blocklm_10B.json', '--finetune', '--cloze-eval', '--experiment-name', 'blank-base-copa_04-17-02-31', '--task', 'COPA', '--data-dir', '/root/data/superglue/COPA', '--save', '/root/data/checkpoints', '--seq-length', '256', '--checkpoint-activations', '--eval-batch-size', '16', '--save-epoch', '100000', '--num-workers', '1', '--no-load-optim', '--no-load-lr-scheduler', '--block-lm', '--num-layers', '12', '--hidden-size', '768', '--num-attention-heads', '12', '--max-position-embeddings', '512', '--tokenizer-model-type', 'bert-base-uncased', '--tokenizer-type', 'BertWordPieceTokenizer', '--load-pretrained', '/root/data/checkpoints/blocklm-base-blank', '--lr-decay-style', 'linear', '--warmup', '0.1', '--weight-decay', '1.0e-1', '--pattern-id', '0', '--save-interval', '10000', '--log-interval', '20', '--eval-interval', '1000', '--eval-iters', '100', '--fp16', '--model-parallel-size', '1', '--continuous-prompt', '--num-prompt-tokens', '3', '--epochs', '100', '--overwrite']' returned non-zero exit status 1

Simple questions on GLM pretraining mechanism

For GLM: General Language Model Pretraining with Autoregressive Blank Infilling ,

May I ask how is the sampling for input division in step (b) being done ?
why in step (c), the green x3 is moved to the end ? why is the maximum value in Position 1 limited to 5 instead of 6 ?
why Part A tokens cannot attend to Part B tokens ? but Part B tokens can attend to A ?

继续预训练如何加载模型？

我在pretrain_glm.py继续预训练加载下载下来的glm-large-chinese/mp_rank_00_model_states.pt时报错：

WARNING: could not find the metadata file /root/Data/zz/GitHub/GLM/blocklm-large-chinese/latest_checkpointed_iteration.txt 
Try to directly load the checkpoint from the directory
Traceback (most recent call last):
  File "pretrain_glm.py", line 663, in <module>
    main()
  File "pretrain_glm.py", line 580, in main
    args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
  File "/root/Data/zz/GitHub/GLM/utils.py", line 337, in load_checkpoint
    checkpoint_name, sd = model.load_checkpoint(load_dir, tag,
  File "/root/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/root/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2671, in _load_checkpoint
    client_state['optimizer'] = optim_checkpoint['optimizer']
KeyError: 'optimizer'

提供的模型文件要怎么才能正确加载呢？

Can't run models : RuntimeError: CUDA error: invalid device ordinal

When I tried to run finetune_superglue.sh, I encounted the following error.

bash scripts/finetune_superglue.sh      config_tasks/mo
del_blocklm_base.sh      config_tasks/task_atomic.sh
mkdir: cannot create directory ‘logs’: File exists
Traceback (most recent call last):
  File "finetune_glm.py", line 324, in <module>
    initialize_distributed(args)
  File "/home/pouramini/GLM/pretrain_glm.py", line 438, in initialize_distributed
using world size: 2 and model-parallel size: 1 
 > using dynamic loss scaling
    torch.cuda.set_device(device)
  File "/home/pouramini/miniconda3/envs/glm/lib/python3.7/site-packages/torch/cuda/__init__.
py", line 261, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "/home/pouramini/miniconda3/envs/glm/lib/python3.7/runpy.py", line 193, in _run_modul
e_as_main
    "__main__", mod_spec)
  File "/home/pouramini/miniconda3/envs/glm/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/pouramini/miniconda3/envs/glm/lib/python3.7/site-packages/torch/distributed/la
unch.py", line 340, in <module>
    main()
  File "/home/pouramini/miniconda3/envs/glm/lib/python3.7/site-packages/torch/distributed/la
unch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/pouramini/miniconda3/envs/glm/lib/python3.7/site-packages/torch/distributed/la
unch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/pouramini/miniconda3/envs/glm/bin/python', '
-u', 'finetune_glm.py', '--local_rank=1', '--finetune', '--cloze-eval', '--experiment-name', 
-save', '/root/data/finetune_checkpoints', '--seq-length', '256', '--checkpoint-activations'
, '--batch-size', '8', '--eval-batch-size', '16', '--save-epoch', '50', '--seed', '1234', '-
-block-lm', '--num-layers', '12', '--hidden-size', '768', '--num-attention-heads', '12', '--
max-position-embeddings', '512', '--tokenizer-model-type', 'bert-base-uncased', '--tokenizer
-type', 'BertWordPieceTokenizer', '--load-pretrained', '/root/data/checkpoints/block-lm-blan
k-cls12-18-12-50', '--lr-decay-style', 'linear', '--warmup', '0.1', '--weight-decay', '1.0e-
1', '--save-interval', '10000', '--log-interval', '50', '--eval-interval', '1000', '--eval-i
ters', '100', '--epochs', '50', '--lr', '1e-5', '--overwrite']' returned non-zero exit statu
s 1.
Killing subprocess 13782
Killing subprocess 13783

有没有模型压缩量化等方法提高模型的推理性能

你好，感谢开源GLM大模型，目前在V100上文本生成推理相应在1-2min间，有没有模型压缩等方法提高推理性能？

GLM-2B-Chinese Pretrained Model please

I think it is suitable for Apple silicon M2 24G memory

Why vocabulary is divided by GPU number and how to load it?

When I train a pretraining model, the vocabulary is divided by the number of GPU. So I can't directly load it with origin model in downstream tasks.
How should I do?Thanks!

run infer failed

I use A100 40G * 8 to run the huggingface hub code and failed.
I try to add device_map='auto' at AutoModelForSeq2SeqLM.from_pretrained,but not support.
how to run this code?

does the GLM perform well than bert on text similarity task and ner task?

hi, i want to know how GLM can be adopted for ner task. By the way, does it perform well on text similarity task than bert? thx!

Can you provide the test.json file mentioned in TestDataset in data_utils/corpora.py？

I think this file(test.json) is small and is very important for running pretraining code. Thank you very much!!!

Consider using an different acronym than "GLM"

In your recent paper, you introduce a new method, GLM (General Language Model), and refer to this algorithm by the name "GLM" in your paper.

I wanted to offer the comment that using a name like "GLM" will likely lead to a lot of confusion since "GLM" has long referred to "Generalized Linear Model" in both the statistics and machine learning communities. There's nearly a 50 year history of using this term and it's probably the most widely-used and referred-to machine learning algorithm in existence.

Some alternative ideas:

GenLM: "General Language Model" (same name, different acronym)
GPF: "General Pretraining Framework" (in your title)
GPTLM or GPLM: "General Pretraining/Pretrained Language Model"

In your abstract: Empirically, GLM substantially outperforms BERT on the SuperGLUE natural language understanding benchmark with the same amount of pre-training data.

This type of sentence will require constant clarification and disambiguation on your part and on the part of the community, so I hope you will consider a new name for your method that's not already in use. Thank you for the consideration.

Hardware requirements for GLM-chinese-10B

I had an 4 * V100 (4*32G) server, but OOM when I tried to finetune the GLM-chinese-10B model.
What's the minimal hardware requirements?

模型权重加载问题

在进一步预训练时执行模型加载的时候报错：

RuntimeError: Error(s) in loading state_dict for GLMModel:
	Missing key(s) in state_dict: "word_embeddings.weight", "transformer.block_position_embeddings.weight". 
	Unexpected key(s) in state_dict: "mixins.block_position_embedding.block_position_embeddings.weight", "transformer.word_embeddings.weight".

模型是在经过change_mp.py切分为4片之后的模型

convert pretrained pt to huggingface

thanks your work.
I fine-tune it at my own dataset and now I want to convert it to your huggingcace format.
can you share the script that you used to convert pretrained pt to the huggingface ?

attention mask between spans

When predicting x3, can the model see [x5, x6] ?
The attention mask figure (d) seems to say yes.

continue pretrain的时候遇到loss scale的问题，怎么解决？

2022-09-14 22:57:53,096] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
[2022-09-14 22:57:53,096] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
[2022-09-14 22:57:53,096] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
[2022-09-14 22:57:53,096] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
[2022-09-14 22:57:53,096] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
[2022-09-14 22:57:53,096] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
[2022-09-14 22:57:53,096] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
[2022-09-14 22:57:53,097] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
[2022-09-14 22:57:55,597] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-09-14 22:57:55,597] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-09-14 22:57:55,597] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-09-14 22:57:55,597] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-09-14 22:57:55,597] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-09-14 22:57:55,597] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-09-14 22:57:55,597] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-09-14 22:57:55,597] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-09-14 22:57:57,418] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-09-14 22:57:57,418] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-09-14 22:57:57,418] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-09-14 22:57:57,418] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-09-14 22:57:57,419] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-09-14 22:57:57,419] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-09-14 22:57:57,419] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-09-14 22:57:57,419] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-09-14 22:57:59,286] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-09-14 22:57:59,286] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-09-14 22:57:59,286] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-09-14 22:57:59,287] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-09-14 22:57:59,287] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-09-14 22:57:59,287] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-09-14 22:57:59,287] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-09-14 22:57:59,287] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-09-14 22:58:01,089] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-09-14 22:58:01,089] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-09-14 22:58:01,089] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-09-14 22:58:01,089] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-09-14 22:58:01,089] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-09-14 22:58:01,089] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-09-14 22:58:01,089] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-09-14 22:58:01,089] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-09-14 22:58:03,610] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-09-14 22:58:03,610] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-09-14 22:58:03,610] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-09-14 22:58:03,610] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-09-14 22:58:03,610] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-09-14 22:58:03,610] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-09-14 22:58:03,610] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-09-14 22:58:03,610] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-09-14 22:58:04,962] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-09-14 22:58:04,963] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-09-14 22:58:04,963] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-09-14 22:58:04,963] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-09-14 22:58:04,963] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-09-14 22:58:04,963] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-09-14 22:58:04,963] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-09-14 22:58:04,963] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-09-14 22:58:07,493] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-09-14 22:58:07,493] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-09-14 22:58:07,493] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-09-14 22:58:07,493] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-09-14 22:58:07,494] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-09-14 22:58:07,494] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-09-14 22:58:07,494] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-09-14 22:58:07,494] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-09-14 22:58:10,152] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-09-14 22:58:10,152] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-09-14 22:58:10,152] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-09-14 22:58:10,152] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-09-14 22:58:10,152] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-09-14 22:58:10,152] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-09-14 22:58:10,152] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-09-14 22:58:10,152] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-09-14 22:58:11,943] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-09-14 22:58:11,943] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-09-14 22:58:11,943] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-09-14 22:58:11,943] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-09-14 22:58:11,943] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-09-14 22:58:11,943] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-09-14 22:58:11,943] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-09-14 22:58:11,943] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-09-14 22:58:13,765] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-09-14 22:58:13,765] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-09-14 22:58:13,765] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-09-14 22:58:13,765] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-09-14 22:58:13,765] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-09-14 22:58:13,765] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-09-14 22:58:13,765] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-09-14 22:58:13,765] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-09-14 22:58:16,346] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-09-14 22:58:16,346] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-09-14 22:58:16,346] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-09-14 22:58:16,346] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-09-14 22:58:16,346] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-09-14 22:58:16,346] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-09-14 22:58:16,346] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-09-14 22:58:16,346] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-09-14 22:58:18,958] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-09-14 22:58:18,958] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-09-14 22:58:18,958] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-09-14 22:58:18,958] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-09-14 22:58:18,958] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-09-14 22:58:18,958] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-09-14 22:58:18,958] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-09-14 22:58:18,959] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-09-14 22:58:20,798] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-09-14 22:58:20,798] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-09-14 22:58:20,798] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-09-14 22:58:20,799] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-09-14 22:58:20,799] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-09-14 22:58:20,799] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-09-14 22:58:20,799] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-09-14 22:58:20,799] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-09-14 22:58:25,414] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-09-14 22:58:25,414] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-09-14 22:58:25,414] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-09-14 22:58:25,414] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-09-14 22:58:25,414] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-09-14 22:58:25,414] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-09-14 22:58:25,414] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-09-14 22:58:25,414] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-09-14 22:58:28,171] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 4 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-09-14 22:58:28,171] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 6 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-09-14 22:58:28,171] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 1 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-09-14 22:58:28,172] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 3 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-09-14 22:58:28,172] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 2 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-09-14 22:58:28,172] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-09-14 22:58:28,172] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 5 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-09-14 22:58:28,172] [INFO] [stage2.py:1387:step] [deepspeed] fp16 dynamic loss scale overflow! Rank 7 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
Traceback (most recent call last):

RuntimeError: CUDA out of memory. Tried to allocate 4.63 GiB (GPU 6; 31.75 GiB total capacity; 20.85 GiB already allocated; 4.46 GiB free; 25.71 GiB reserved in total by PyTorch)

Optimizer state when changing MP(Model Parallelism) SIZE

Can you tell me why change_mp.py does not preserve 'optimizer'??

Not preserving optimizer states matters me because there is a case where I need to change the number of GPUs during pretraining.

But current change_mp.py doesn't preserve 'optimizer states', so it makes error(Nonetype object is not subscriptable at this line)) when I load GLM checkpoint after decrease MP_SIZE of current checkpoint.

Otherwise, is there any way to change the number of GPUs during pretraining?
* I used the torch distributed package instead of DeepSpeed.

Thanks

WARNING: could not find the metadata file /root/data/checkpoints/blocklm-large-chinese/latest_checkpointed_iteration.txt

If the data of a row exceeds 512, will the excess part be discarded?

Excuse me, I read the function def get_item() in class BlockDatast and found it only intercept 512 tokens. So is the rest of it discarded?

(I doubt this statement, because there are two data segments from the same input data)

Thanks for your solution!

运行(bash scripts/generate_block.sh config_tasks/model_blocklm_10B_chinese.sh)代码时生成的文本与示例中的不一致

运行环境
# ============
使用的main分支下的代码
运行脚本：bash scripts/generate_block.sh config_tasks/model_blocklm_10B_chinese.sh
checkpoint file: GLM-XXLarge-Chinese 10B
# ============

# 测试时候的输出
Context: 问题：冬天，**哪座城市最适合避寒？问题描述：能推荐一些国内适合冬天避寒的城市吗？回答用户：旅游爱好者 回答：[gMASK]

GLM: [CLS]问题:冬天,**哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者 回答:[gMASK]<|startofpiece|>避寒的话当然是成都啦!成都的冬天真的十分美好,天气暖和,还有不少小吃,来成都避寒真的再合适不过了。 成都的冬天没有雾霾,蓝天白云,温度也很适宜,不像东北那样冷得刺骨。成都是一个生活节奏很慢的城市,在快节奏的生活下,成都的冬天可以让你慢下来,享受生活带来的美好。 下面安利一波来成都避寒冷门的游玩线路吧! 首先推荐一条冬天来成都必打卡的线路!那就是去西岭雪山!! 西岭雪山位于成都市大邑县境内。由大邑县境内终年积雪的西岭而得名。景区面积375平方公里,为邛崃山脉的尾端。西岭雪山常年积雪。有“日出成都、大雪纷飞”的美誉。 冬天去西岭雪山简直是享受雪地的乐趣啊!而且西岭雪山距离成都市区不到90公里,开车1个多小时就能到啦!去之前最好查好天气预报,去之前一定要穿厚实暖和的羽绒服哦! 西岭雪山景区海拔约2500米,是一个大型高山滑雪场。雪山气势宏伟,终年积雪,山上有终年不化的积雪,银光灿灿,是大自然赐与成都平原地区最壮观瑰丽的自然景象。雪场目前有初级道、中级道、高级道共三条。初级道,中高级道的滑雪区域都是对外开放的。 成都的冬天非常阴冷,如果去西岭雪山的话一定不要忘记带上羽绒服哦! 除了冬季玩雪,成都在其他季节也是非常值得游玩的,尤其是春天,漫山的桃花、油菜花还有梨花,简直美翻了! 成都的春天的美景太多啦,就不一一介绍啦!如果有机会的话,大家还可以去成都周边走走,感受不一样的成都,大家也可以去看看大冰的小屋哦,里面有很多成都周边游玩的攻略。

# 示例
Context: 问题：冬天，**哪座城市最适合避寒？问题描述：能推荐一些国内适合冬天避寒的城市吗？回答用户：旅游爱好者 回答： [gMASK]

GLM:海南三亚,三亚的冬天是三亚的旺季,因为冬天国内大部分地方都有雾霾,而三亚却是没有雾霾的地方,所以三亚是冬天最惬意不过的地方了。在东北长大的我觉得三亚简直就是一个天堂,冬天去海口三亚旅游,享受冬天,享受阳光沙滩。但是海口却是很干燥,需要多喝水。 三亚冬天最热门的玩法就是晒太阳,在海边晒晒太阳,在沙滩上晒晒太阳,感觉整个人都得到了解放。三亚还有一个特色项目,就是海上冲浪,在三亚的沙滩上冲浪也是一件非常刺激的事情。 海口,海口冬季的阳光十分温暖,海南的冬季也是属于冬季旅游的旺季。冬季的海口最棒的是去海南的热带野生动植物园,那里有数之不尽的热带小动物,在这里可以近距离的和它们接触,海南的热带野生动植物园也是海南的天然氧吧。还可以在海口观澜湖公园里感受海口美丽的海景。 贵阳,贵州的冬天也是十分温暖的,贵阳也是冬季避寒很好的城市之一。冬季去贵阳玩一定要去黔灵山,黔灵山是贵州香火很旺盛的一个寺庙,寺庙的冬季香火鼎盛,在冬季去寺庙游玩也是一个很好的体验。除了黔灵山,贵阳在冬季还有花溪公园可以去玩,花溪公园也是去当地公园玩最好的选择。 青岛,青岛的冬天是青岛最舒服的时候,青岛有很多海滨浴场,冬天去海边泡一泡温泉,然后晒晒太阳是一件十分惬意的事情。青岛也有沙滩,冬天在沙滩上晒晒太阳,看看海,再玩玩沙滩游戏,感觉十分快乐的事。

10B 中文模型有下游数据集的效果么？

RT: 10B 中文模型有下游数据集的效果么？

Unable to use `AutoModelForSeq2SeqLM`

Hi, I tried to use AutoModelForSeq2SeqLM as described in your README. However, I faced this error. Can you please advice? Thank you!

GLM-chinese compare to nezha and roformer-v2 ?

I'm curious about the comparison of the three models on downstream tasks. Has anyone tried it?

Will you release this code in the future?

Does it support multilingual ?

Can the model be used with languages other than Enlgish? I mean a multi-lingual version?

How many cards do you need to fine-tune this model?

How many cards do you need to fine-tune this model? Can 4 V100s run it? Are the hardware resources consumed by DeepSpeed version and HuggingFace version the same?

请问有没有什么办法可以快速的使用GLM得到词向量呢？

In `GLM-10B-Chinese`, token id for `[gMASK]` and `[eop]` is the same. Is it a designed behavior?

It seems that the tokenizer used in GLM-10B-Chinese assign same token id for the [gMASK] and [eop] token.

Specifically, I launch the pre-training process with the following command:

bash scripts/ds_pretrain_nvidia.sh config/ds_block_10B_chinese.sh

As can be seen from the log that both eop and gMASK are assigned with the same token id 50007.

    {'pad': 50000, 'eos': 50000, 'sep': 50001, 'ENC': 50002, 'MASK': 50003, 'unk': 50004, 'sop': 50006, 'eop': 50007, 'gMASK': 50007, 'sMASK': 50008, 'dBLOCK': 50009}

I have the following questions:

Is it true that in the pre-training phase, you use the same id for [eop] and [gMASK]?
Is this a designed behavior? If so, why use the same id for [gMASK] equal to [eop]? If not, is there any plan to fix it?

Moreover, it seems that in your huggingface version. The [gMASK] token is assigned with an id of 50009, and you manually copied the embedding of 50007 to 50009 in the embedding layer (see convert_glm_checkpoint_to_transformers.py). Then why not use different id for these two tokens in the first place?

model.backward在iter == arg.eval-interval时卡住无法继续

您好，我在尝试运行GLM finetune代码时遇到了这样一个问题：我在其它iteration时都可以正常进行forward, backward和optimize step，但是当iteration index == arg.eval-iterval时，代码会在train_utils.py里的backward_step方法中的model.backward(loss)步骤卡住。希望您能帮我解答这个问题，谢谢～

deepspeed参数：
{
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 1,
"steps_per_print": 50,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e7,
"allgather_bucket_size": 5e7,
"cpu_offload": true
},
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 5e-6,
"betas": [
0.9,
0.95
],
"eps": 1e-8,
"weight_decay": 1e-2
}
},
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}

model config：
使用GLM-10B-zh，参数为
MODEL_TYPE="blocklm-10B"
MODEL_ARGS="--block-lm
--cloze-eval
--task-mask
--num-layers 48
--hidden-size 4096
--num-attention-heads 64
--max-position-embeddings 1024
--tokenizer-type ChineseSPTokenizer
--load-pretrained ${CHECKPOINT_PATH}"

任务参数：
TRAIN_ARGS="--epochs 1
--batch-size 64
--lr 1e-5
--lr-decay-style linear
--warmup 0.06
--weight-decay 1.0e-1
--label-smoothing 0.1"

COMMON_ARGS="--save-interval 10000
--log-interval 50
--eval-interval 5
--eval-iters 100
--eval-epoch 2"

TASK_ARGS="--src-seq-length 608
--tgt-seq-length 15
--min-tgt-length 4
--length-penalty 0.7
--no-repeat-ngram-size 3
--num-beams 2
--select-topk
--eval-batch-size 1"

gpu：16G DCU卡

其中，每次代码在运行到iteration_ == 5，也就是第6个iter的时候会卡住

HuggingFace module

I read your paper with great interest. You seem to have a lot of novel ideas about how to improve the pretraining. Some of the scores are really impressive. I would like to test some of these ideas on other corpuses.

Have you considered making the code available as a HuggingFace module (TensorFlow/PyTorch/Flax)? I think this would lead to a lot more people looking into your ideas.

how to choose the finetuning script for question-answering task

I have a task ,which given a question, using the [sMASK] token to generate the answer sentence, which finetuning script should I use?
Should I use the text-summarization finetuning script ?

Which config is used to pretrain the released `GLM-10B-Chinese` model? is `ds_block_10B_chinese_longer.sh` or `ds_block_10B_chinese.sh`

Hi, I am wondering which config did you use to pretrain the released GLM-10B-Chinese model?

It seems that you released two configs: ds_block_10B_chinese_longer.sh and ds_block_10B_chinese.sh.

Which one is used to produce the released GLM-10B-Chinese ckpt?

bash scripts/ds_pretrain_nvidia.sh config/ds_block_10B_chinese_longer.sh

bash scripts/ds_pretrain_nvidia.sh config/ds_block_10B_chinese.sh

When is ds_block_10B_chinese_longer.sh used?

generate有没有并行的方法

观察到generate_samples没有提供批生成的接口，只能作为demo使用，不知道有没有较为快捷的方法？谢谢！

Unrecognized configuration class

When I load glm-large-chinese model, I got an error:

Model type should be one of BartConfig, BigBirdPegasusConfig, BlenderbotConfig, BlenderbotSmallConfig, EncoderDecoderConfig, FSMTConfig, LEDConfig, LongT5Config, M2M100Config, MarianConfig, MBartConfig, MT5Config, MvpConfig, PegasusConfig, PegasusXConfig, PLBartConfig, ProphetNetConfig, SwitchTransformersConfig, T5Config, XLMProphetNetConfig.
  File "/data3/xingyum/models/AntiFraudChatBot-main/finetuning/test_glm.py", line 6, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained("BAAI/glm-large-chinese", trust_remote_code=True)

Here is my code:

from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
tokenizer = AutoTokenizer.from_pretrained("BAAI/glm-large-chinese", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("BAAI/glm-large-chinese", trust_remote_code=True)

[quesiton] block_position_ids的含义

请教一个问题，模型的一个输入是position_ids，它的shape是[batch_size, 2, seq_len]。在模型后续的处理中将postion_ids拆分成了两个shape为[batch_size, seq_len]的tensor(position_ids和block_position_ids), position_ids是一个0到seq_len的递增序列，很好理解。请问一下block_position_ids的含义是什么？

About "Text Summarization"

似乎您还没有开放在 Text Summarization 任务上的测试代码，在论文中您提供的关于seq2seq的描述也不是很详细。所以有一些问题想请教您。
您是采用的encoder-decoder 的结构吗？需要另外添加一个decoder, 并在下游数据集上进行微调？

自定义tokenizer

请问GLM-10B-Chinese的tokenizer是否支持添加自定义的token？如果支持的话大概的方式是什么？非常感谢！

How to pre-train? has anyone start pre-train successful?

the repo contains pretrain_glm.py and scripts/ds_pretrain_nvidia.sh, and it got an args $1, so what is this $1？ is it the config_tasks/seq_blank.sh?
I want to know if the model arch is GPT-2 for all tasks? just use the mask matrix to cover bi-di and uni-di attention? and if so, how can the model know the cloze's orders? for example, if gpt-2 and two clozes like "A [MASK1] B [MASK2] C" and model predict "[s] x1 x2 [\s] [s] x3[\s]" , with filling back , how to know its [A][x1][x2]B[x3][C], not [A][x3]B[x1][x2][C]?

thanks a lot if someone answers.

Hardware requirements

I was trying to find hardware requirements for serving, and maybe also fine tuning, a monolingual version for English. Where can I find it? Is it something that can also be potentially added to the README.md?

Are the requirements comparable to those needed to serve Meta AI OPT models?

Are pretraining codes released?

I found that the code in pretrain_glm.py is different from the description in your paper GLM, I don't know if I've misunderstood your code because there aren't many comments?

Information about those new released multi-task model

Hi there,

I wonder whether there is documentation about the newly released multi-task model checkpoint. For example, explaining whether it is trained on the multi-lingual dataset and how those models are trained (in terms of multi-task).

Thanks in advance.

Evaluation on SQuAD

Hello.
I want to ask you 2 things.

Evaluate GLM on SQuAD task for EM(Exact Match)/F1 to reproduce Table 10 results in your paper .
Following your code, which task (seq2seq/supergue) you evaluate is ambiguous.
If you evaluated on seq2seq task, there's no source code for metric(EM/F1).
And if you evaluated on superglue task, It doesn't make sense to measure EM/F1 because your code's metric for 'squad' is on accuracy_metric.
So, how did you get SQuAD task results (EM/F1) in Table 10 in your paper? (what command or path the code is executed?)
How to implement blank-infilling task with no label/candidates for superglue task.
(1) Based on your code, when the task is multi_token=True, you just implemented get_labels() returns only ['0'].
I wonder what the part(get_labels() returns ['0']) means.
(2) I checked you implemented SquadPVP, CMRCPVP(tasks/superglue/pvp.py) which have no labels/candidates.
But when I actually run Squad task in superglue directory, the model predicts only label(0) not things like token IDs.
So I don't know how to implement such a case for getting token IDs by blank-infilling task in the superglue code.

Thanks

Training and inference issue

Dear authors, thanks for your great work.

I have a little question. The paper says that the spans are shuffled in the training time.
I was wondering whether the predicted spans are in order when inference, and how to make them orderly?

Look forward to hearing from you.

Accelerate the model inference of GLM-10B

Hi, I want to accelerate the model inference of GLM-10B, do you have any suggestion? Or have some attempts.

The multi-task learning setting is different from the original paper

According to the GLM paper, multi-task learning has two way, one is a mixture of the blank infilling object and the sentence-level objective, another is a mix of the blank infilling object and the document-level objective.
But when I read the pre-training code (/config/ds_block_chinese.sh), I found that multi-task learning takes 40% blank infilling object, 30% sentence-level objective, 30% document-level objective.
Am I understanding it wrong?

...
gpt_options=" \
       --block-lm \
       --task-mask \
       --bert-prob 0.4 \
       --gap-sentence-prob 0.3 \
...

Do we have a GLM model of Chinese version?

Hi,
Thanks for releasing code and model so nicely.
But it seems the released GLM model is only for English, do we have a Chinese version?

Thanks a lot.

text infilling cases

Thanks for your wonderful work!

I tried

bash scripts/generate_block.sh \
     config_tasks/model_blocklm_large.sh

with many context inputs, but got unsatisfied predictions (weird tokens or not consistently with local context), for example:

#1
Context: Ng is a good teacher at [MASK] .

GLM: [CLS] ng is a good teacher at [MASK] . [PAD] <|startofpiece|> are . e

#2
Context: Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a [MASK] in online education

GLM: [CLS] ng is an adjunct professor at [MASK] ( formerly associate professor and director of its stanford ai lab or sail ) . also a [MASK] in online education [PAD] <|startofpiece|> the university of arizona <|startofpiece|> researcher at the

#3
Context: Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a [MASK] in online education, Ng co-founded Coursera and deeplearning.ai.

GLM: [CLS] ng is an adjunct professor at [MASK] ( formerly associate professor and director of its stanford ai lab or sail ) . also a [MASK] in online education , ng co - founded coursera and deeplearning . ai . [PAD] <|startofpiece|> the university of michigan <|startofpiece|> senior associate at the university of

I want to generate multiple spans for a given context, I was wondering whether I have mistakes in using the scripts.

thudm / glm Goto Github PK

glm's Introduction

GLM

Pretrained Models

Results

Seq2Seq

Language Modeling

Get Started

Hugging Face Hub

Generation

Classification

Docker Image

Manual Installation

Clone this repo

Model Parallelism

Usage

Left-to-Right Generation / Blank Filling (Interactive)

Usage of [MASK] (Entity Prediction):

Example1

Example2 (Chinese)

Usage of [sMASK] (Sentence Prediction)

Example3

Example4 (Chinese)

Usage of [gMASK] (Long Text Generation)

Example5 (Chinese)

Example1

Example2 (Chinese)

SuperGLUE

Seq2Seq

Train with your own data

Multiple Choice (Zero-shot)

Language Modeling

LAMBADA Cloze Accuracy

LM Perplexity

Text Infilling

Pretrain

Citation

glm's People

Contributors

Stargazers

Watchers

Forkers

glm's Issues

modify model_path inside generate_block.sh here ; I'm using glm-1.5-generation.tar.bz2

Recommend Projects

Recommend Topics

Recommend Org

Usage of `[MASK]` (Entity Prediction):

Usage of `[sMASK]` (Sentence Prediction)

Usage of `[gMASK]` (Long Text Generation)