microsoft / lmops Goto Github PK

General technology for enabling AI capabilities w/ LLMs and MLLMs

License: MIT License

Makefile 0.01% Python 96.66% Batchfile 0.01% Shell 0.88% C++ 0.11% C 0.01% Cuda 0.43% Perl 0.01% Cython 0.04% Lua 0.01% Dockerfile 0.04% Jsonnet 0.01% Jupyter Notebook 1.45% MDX 0.33%

nlp agi gpt llm lm pretraining prompt lmops promptist x-prompt

lmops's Introduction

LMOps

LMOps is a research initiative on fundamental research and technology for building AI products w/ foundation models, especially on the general technology for enabling AI capabilities w/ LLMs and Generative AI models.

Better Prompts: Automatic Prompt Optimization, Promptist, Extensible prompts, Universal prompt retrieval, LLM Retriever, In-Context Demonstration Selection
Longer Context: Structured prompting, Length-Extrapolatable Transformers
LLM Alignment: Alignment via LLM feedback
LLM Accelerator (Faster Inference): Lossless Acceleration of LLMs
LLM Customization: Adapt LLM to domains
Fundamentals: Understanding In-Context Learning

News

[Paper Release] Nov, 2023: In-Context Demonstration Selection with Cross Entropy Difference (EMNLP 2023)
[Paper Release] Oct, 2023: Tuna: Instruction Tuning using Feedback from Large Language Models (EMNLP 2023)
[Paper Release] Oct, 2023: Automatic Prompt Optimization with "Gradient Descent" and Beam Search (EMNLP 2023)
[Paper Release] Oct, 2023: UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation (EMNLP 2023)
[Paper Release] July, 2023: Learning to Retrieve In-Context Examples for Large Language Models
[Paper Release] April, 2023: Inference with Reference: Lossless Acceleration of Large Language Models
[Paper Release] Dec, 2022: Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers
[Paper & Model & Demo Release] Dec, 2022: Optimizing Prompts for Text-to-Image Generation
[Paper & Code Release] Dec, 2022: Structured Prompting: Scaling In-Context Learning to 1,000 Examples
[Paper Release] Nov, 2022: Extensible Prompts for Language Models

Prompt Intelligence

Advanced technologies facilitating prompting language models.

Promptist: reinforcement learning for automatic prompt optimization

[Paper] Optimizing Prompts for Text-to-Image Generation

Language models serve as a prompt interface that optimizes user input into model-preferred prompts.

Learn a language model for automatic prompt optimization via reinforcement learning.

Structured Prompting: consume long-sequence prompts in an efficient way

[Paper] Structured Prompting: Scaling In-Context Learning to 1,000 Examples

Example use cases:

Prepend (many) retrieved (long) documents as context in GPT.

Scale in-context learning to many demonstration examples.

X-Prompt: extensible prompts beyond NL for descriptive instructions

[Paper] Extensible Prompts for Language Models

Extensible interface allowing prompting LLMs beyond natural language for fine-grain specifications

Context-guided imaginary word learning for general usability

LLMA: LLM Accelerators

Accelerate LLM Inference with References

[Paper] Inference with Reference: Lossless Acceleration of Large Language Models

Outputs of LLMs often have significant overlaps with some references (e.g., retrieved documents).

LLMA losslessly accelerate the inference of LLMs by copying and verifying text spans from references into the LLM inputs.

Applicable to important LLM scenarios such as retrieval-augmented generation and multi-turn conversations.

Achieves 2~3 times speed-up without additional models.

Fundamental Understanding of LLMs

Understanding In-Context Learning

[Paper] Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers

According to the demonstration examples, GPT produces meta gradients for In-Context Learning (ICL) through forward computation. ICL works by applying these meta gradients to the model through attention.

The meta optimization process of ICL shares a dual view with finetuning that explicitly updates the model parameters with back-propagated gradients.

We can translate optimization algorithms (such as SGD with Momentum) to their corresponding Transformer architectures.

Hiring: aka.ms/GeneralAI

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and AGI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to [email protected].

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using the pre-trained models, please submit a GitHub issue. For other communications, please contact Furu Wei ([email protected]).

lmops's People

Contributors

Stargazers

Watchers

Forkers

yrdddream techthiyanes eltociear syaikhipin mbrukman alirezabayatmk rosssong jasondotparse kinddevil jay-cryptic-road c00renut nimsala1234 markhng525 nashid kamelkaouech dmater01 ewvue spursy rpatil524 benjamin-ky deltavml skp80 deanofthewebb jxzhangjhu stjordanis timfoote carlosurteaga yani-abdesselam sguard-inc hertera1 shivajid vishcodex yongzx muharremokutan caiogasparine ai-hub-deep-learning-fundamental ukaserge ssanchez-sonatype jcmo-tec21 techventurebuilder ailabteam amirulandalib 19861826 standardgalactic fernando83mat adai-5090 isabelafaraujo eycab balakreshnan ustcwhy dan255 sleekss jeonsworld c6ai dhanushkanadeeshan lilujunai herisai maniyantingliu etherx-dev michaelsourc buaahsh cdxeve nguyenducnhaty smksyj r3dfruitrollup vcip2015 iali61 mittalpusa sikkha 00mjk evdcush zzmjohn contropist yubobo aicodehunt sadjadasghari hunter-ddm guangyangsjc18 onemersa magicpwz feng-huang apollohuang1 qxr20-academy nyanyanya saibaldasprivate zixind haorand chunlingpeng ai-natural-language-processing-lab t1101675 pengfeiwu1999 leejodie trentbrucegithub ai-ar4s-dev miriam980729 kiming-ng lzhgrla mldk-tech zishu1992 adrianwedd

lmops's Issues

Is there source code of LLMA？

[miniLLM] model convert on gpt_2 series

Hi, are there any bugs with the splitting and merging code of the GPT2 series model?
In file 'transformers/src/transformers/models/gpt2_parallel/utils_gpt2.py', the model splitting function 'increase_mp_gpt2()' considered different situations thus appearing a little bit more complicated than that of llama and opt. I think that's because the way weights of layers stored in gpt2 is different from that of llama and opt. For example, Q/K/V of llama and opt are stored as different matrices individually with size of [hidden_dim, hidden_dim], while QKV of gpt2 are stored together as one single matrix with size of [hidden_dim, 3*hidden_dim].
However, the gpt2's model merging function 'decrease_mp_gpt2()' does not match the 'increase_mp_gpt2()'. GPT2's 'decrease_mp_gpt2()' is totally same as that of llama and opt, without considering the difference mentioned above. Is that correct?
Thank you for your reply!

experiment question about [Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers]

Much thanks for your excellent work about:
Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers
I want to know if you have verified that is W_{ICL} similar to W_{FT}, which may more directly verify the relationship between In-context learning and fine-tuning, and further demonstrate the motivation in the article.

Can I apply minillm to phoenix model

Hello, I would like to ask Can I apply minillm to phoenix model.
What code do I need to modify and how to modify it?

Thanks for your help

[miniLLM] how to use lora training when the model mp_size > 1

Sorry to bother you guys again.

With the lora script, I can use lora to speed up the training of llama-7b and llama-13b if I do not use model parallel and MP=4 arugument ). The training is much faster with lora.

However, the peft library cannot be directly applied to The ParallelLlamaForCausalLM class because some subclasses used by ParallelLlamaForCausalLM is not supported by peft.

The following is the error message if I apply peft when I set model parallel and MP=4 arugument.

ValueError: Target module ColumnParallelLinear() is not supported. 
Currently, only `torch.nn.Linear` and `Conv1D` are supported.

However, when I try to use scripts/llama/minillm/train_7B_13B_lora.sh to use the minillm method on 2 x 8 v100 gpus. The student model llama-7b and teacher llama-13b are both trained by lora. I got Out of Memory error because by default model parallel is not used. So I guess MP > 1 is needed.

Running structured prompting with Fairseq fails to start due to import error

Hey, I'm following the guide in the readme to run the manyshots structured prompting example, but it doesn't seem to start up properly with an import error within fairseq.

I'm running in a local (rather than docker) environment, the relevant packages set up with conda, tried it with Python 3.8.15 an 3.9.15.

This is the error message I get:

Traceback (most recent call last):
  File "validate.py", line 9, in <module>
    import struprompting
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/struprompting/__init__.py", line 1, in <module>
    import struprompting.models
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/struprompting/models/__init__.py", line 3, in <module>
    from fairseq.models import import_models
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/__init__.py", line 235, in <module>
    import_models(models_dir, "fairseq.models")
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/__init__.py", line 217, in import_models
    importlib.import_module(namespace + "." + model_name)
  File "/opt/anaconda/envs/fairseq/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/hubert/__init__.py", line 6, in <module>
    from .hubert import *  # noqa
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/hubert/hubert.py", line 20, in <module>
    from fairseq.models.wav2vec.wav2vec2 import (
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/wav2vec/__init__.py", line 6, in <module>
    from .wav2vec import *  # noqa
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/wav2vec/wav2vec.py", line 25, in <module>
    from fairseq.tasks import FairseqTask
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/tasks/__init__.py", line 15, in <module>
    from .fairseq_task import FairseqTask, LegacyFairseqTask  # noqa
  File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/tasks/fairseq_task.py", line 13, in <module>
    from fairseq import metrics, search, tokenizer, utils
ImportError: cannot import name 'metrics' from 'fairseq' (unknown location)

Looks like this happens because when running the script, Python is trying to load the requested fairseq lib from the local folder within that folder where the whole example is (structured_prompting/fairseq-version/), and not using the version that was installed (with the README's pip install --user -e fairseq/ line). That is, the installed library and the local folder are clashing during script import time.

The script also cannot be run from another folder, as it wants to use the validate.py file,and thus the above clash cannot be resolved by just moving into another folder.

I managed to make this work by renaming the fairseq folder and reinstalling the module from the new, non-clashing name. Would this be the ultimate solution? Or am I missing something / doing something incorrectly so this import confusion happens?

Codes about the work: 'Why Can GPT Learn In-Context? '

Much thanks for your excellent work about:
Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers

I think this work will help us have a better understanding about ICL & LMs model designing. Therefore, I'm expect to see the codes about this work, especially for the part of momentum attention.

Open source will lead to positive effect, thx!

Distillation between two different LLM

Hi, just wonder if it is possible to do distillation between two model with different tokenizers. the two tokenizers can be different in vocabulary size or have tokens in different position.

MiniLLM: BOS token is missing in training, but present during evaluation

As the title says, the bos_token_id is omitted during dataset preprocessing (process_data_dolly.py#L40 has add_special_tokens=False), but is not removed during evaluation (prompt_datasets.py#L66, for example, lacks add_special_tokens).

Indeed, datapoints from the prerocessed dolly dataset you've provided have no tokens 1, however, input_ids tensors in the eval script do have it.

So which one is the correct way to use the model?

The LLMA method seems not support batch decoding.

We know that when the batch size increases, it benefits from the powerful parallel capabilities of the GPU, and the speedup is often significantly larger than the 2-3X acceleration mentioned in the llma readme. So, I am quite curious whether I have misunderstood something about this source code.

It seems that if we apply the llma algorithm in batch form, the actual step length for each sentence at each time step varies, making it impossible to form a tensor properly.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

How to define the action space in RL?

Is it possible to share the actions in the RL? For example, how to make variation of the prompt in order to improve the aesthetic score? Thank you very much.

modeling_xgml.py

can you please modify modeling_xgml.py . In such way, we can you your code with huggingface. Thank you

how to import the customized transformers

i wonder how i can import from the local transformers folder. i seems the installed transformers will be considered priorly as i can not import module like ParallelLlamaForCausalLM.

Feature request/Idea

Hi,

You could potentially enable multiple trainer models (say a great coding model and a great conversational model) by using a simple sum of two reverse KLD's?

θ = arg min J (θ) = arg min KL[qθ||p_1] + KL[qθ||p_2]

It would potentially counter the effect of ignoring important minor modes (if and when those exist) the the probability vector. This could arguable speed up base model training if you already have various solid base models with partial world knowledge (a specific language, or programming language).

I dont know enough math to know whether it would then complicate the gradient derivation in equation 2.

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/HardDisk/b00373/LMOps/minillm/checkpoints/opt-1.3B'. Use `repo_type` argument if needed.

When I run this instruction bash scripts/opt/tools/process_data_dolly.sh /PATH/TO/MiniLLM # Process Dolly Train / Validation Data，it has some error messages like 'huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/HardDisk/b00373/LMOps/minillm/checkpoints/opt-1.3B'. Use repo_type argument if needed.'

how can I solve it，thanks a lot for youe help

SFT validation loss of minillm

I did sft (supervised fine-tune) on minillm with the polly dataset by the following order: bash scripts/gpt2/sft/sft_large.sh, and I got the running logs as below:

============================== EXP at 2023-07-24 11:19:35 ==============================
dev | avg_loss: 2.50592041015625 | {'exact_match': 0.0, 'rougeL': 7.2973}
......
[2023-07-24 11:53:25] train | epoch   0 | Iter:   1428/ 14290 | global iter:   1428/ 14290 | loss: 1.8361 | ds_loss: 0.0000 | lr: 4.8781e-05 | scale:  2048.0000 | micro time: 1.180 | step time: 1.187
dev | avg_loss: 2.17413330078125 | {'exact_match': 2.6, 'rougeL': 19.8483}
......
[2023-07-24 12:26:31] train | epoch   1 | Iter:   2856/ 14290 | global iter:   2856/ 14290 | loss: 1.5378 | ds_loss: 0.0000 | lr: 4.5241e-05 | scale:  4096.0000 | micro time: 1.203 | step time: 1.194
dev | avg_loss: 2.283782958984375 | {'exact_match': 3.7, 'rougeL': 23.8785}
......
[2023-07-24 12:58:48] train | epoch   2 | Iter:   4284/ 14290 | global iter:   4284/ 14290 | loss: 0.6872 | ds_loss: 0.0000 | lr: 3.9729e-05 | scale:  8192.0000 | micro time: 1.192 | step time: 1.194
dev | avg_loss: 2.521240234375 | {'exact_match': 3.6, 'rougeL': 25.4109}
......
[2023-07-24 13:32:01] train | epoch   3 | Iter:   5716/ 14290 | global iter:   5716/ 14290 | loss: 0.4638 | ds_loss: 0.0000 | lr: 3.2760e-05 | scale:  8192.0000 | micro time: 1.182 | step time: 1.182
dev | avg_loss: 2.7791748046875 | {'exact_match': 3.9, 'rougeL': 26.4599}
......
[2023-07-24 14:04:43] train | epoch   4 | Iter:   7144/ 14290 | global iter:   7144/ 14290 | loss: 0.2177 | ds_loss: 0.0000 | lr: 2.5055e-05 | scale: 16384.0000 | micro time: 1.193 | step time: 1.190
dev | avg_loss: 2.98834228515625 | {'exact_match': 4.1, 'rougeL': 27.4338}
......
[2023-07-24 14:37:37] train | epoch   5 | Iter:   8572/ 14290 | global iter:   8572/ 14290 | loss: 0.1473 | ds_loss: 0.0000 | lr: 1.7361e-05 | scale: 16384.0000 | micro time: 1.183 | step time: 1.183
dev | avg_loss: 3.14642333984375 | {'exact_match': 4.0, 'rougeL': 27.7023}
......
[2023-07-24 15:10:29] train | epoch   6 | Iter:  10000/ 14290 | global iter:  10000/ 14290 | loss: 0.0627 | ds_loss: 0.0000 | lr: 1.0407e-05 | scale: 32768.0000 | micro time: 1.189 | step time: 1.188
dev | avg_loss: 3.27813720703125 | {'exact_match': 4.2, 'rougeL': 28.5247}
......
[2023-07-24 15:43:35] train | epoch   7 | Iter:  11432/ 14290 | global iter:  11432/ 14290 | loss: 0.0307 | ds_loss: 0.0000 | lr: 4.8715e-06 | scale: 32768.0000 | micro time: 1.194 | step time: 1.192
dev | avg_loss: 3.38690185546875 | {'exact_match': 4.0, 'rougeL': 28.6034}
......
[2023-07-24 16:16:19] train | epoch   8 | Iter:  12860/ 14290 | global iter:  12860/ 14290 | loss: 0.0184 | ds_loss: 0.0000 | lr: 1.3279e-06 | scale: 32768.0000 | micro time: 1.197 | step time: 1.192
dev | avg_loss: 3.46453857421875 | {'exact_match': 4.7, 'rougeL': 28.4784}
......
[2023-07-24 16:49:04] train | epoch   9 | Iter:  14288/ 14290 | global iter:  14288/ 14290 | loss: 0.0097 | ds_loss: 0.0000 | lr: 1.0002e-07 | scale: 65536.0000 | micro time: 1.194 | step time: 1.194
dev | avg_loss: 3.4901123046875 | {'exact_match': 4.2, 'rougeL': 29.0872}

The experiment records above show that as the training progresses, both the exact_match and rougeL scores increase overall, which is consistent with expectations. However, it is puzzling that the validation set loss (avg_loss) is gradually increasing as well. According to common knowledge in deep learning, an increase in the loss function contradicts an increase in recognition/detection accuracy. How can we explain this phenomenon?

minillm: Run DeepSpeed. After initialization, some parameters of the model are missing, such as model_ori.transformer.h[0].attn.c_attn.wight.

def setup_model_and_optimizer(args, ds_config, device, set_optim=True):
# get the model
model = get_model(args, device)
# get the optimizer and lr_scheduler
if set_optim:
optimizer = get_optimizer(args, model)
lr_scheduler = get_learning_rate_scheduler(args, optimizer)
else:
optimizer, lr_scheduler = None, None
model_ori = model
model, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model,
optimizer=optimizer,
args=args,
lr_scheduler=lr_scheduler,
mpu=mpu if args.model_parallel else None,
config_params=ds_config

    Run DeepSpeed. After initialization, some parameters of the model are missing, such as model_ori.transformer.h[0].attn.c_attn.wight.

embed_positions of hf version of Structured Prompting ()

For the hf version of Structured Prompting (modeling_opt.py):
let:

config.max_position_embeddings=2048
config. hidden_size=768

self.embed_positions = OPTLearnedPositionalEmbedding(config.max_position_embeddings, config.hidden_size)

past_key_values_length = 3000
attention_mask.size() is (1, 3400)

pos_embeds = self.embed_positions(attention_mask, past_key_values_length)

Now both past_key_values_length and attention_mask exceed the max_position_embeddings. In this situation, what is the solution thank you

Issue with using minillm for distilling models with different d_model.

Hello!

Thank you for open sourcing this amazing work. We are trying to distill encoder-decoder architectures like flan-t5 using minillm. However, we are facing the following issue when distilling flan-t5-xl to flan-t5-large.

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/train_minillm.py", line 109, in <module>
    main()
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/train_minillm.py", line 95, in main
    train(
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/__init__.py", line 37, in train
    sampler.run_sample(args.num_rollouts)
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/sampler.py", line 71, in run_sample
    gen_out = self.trainer.generate(**batch, return_dict_in_generate=True, mode=mode, teacher_mixed_sample=(self.args.teacher_mixed_alpha is not None), output_scores=True)
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/trainer.py", line 620, in generate
    gen = model.generate(
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/model.py", line 29, in generate
    return self.base_model.generate(**x)
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/generation/utils.py", line 1580, in generate
    return self.sample(
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/generation/utils.py", line 2704, in sample
    m_outputs = mix_in_model(
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 1720, in forward
    decoder_outputs = self.decoder(
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 1090, in forward
    layer_outputs = layer_module(
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 723, in forward
    cross_attention_outputs = self.layer[1](
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 634, in forward
    attention_output = self.EncDecAttention(
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 522, in forward
    key_states = project(
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 500, in project
    hidden_states = shape(proj_layer(key_value_states))
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x1024 and 2048x2048)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11072) of binary: /home/ec2-user/SageMaker/tanayn/kd/bin/python3.10
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/tanayn/kd/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I believe this is happening since the d_model size for both the models differ (1024 for flan-t5-large and 2048 for flan-t5-xl).

So, I have two questions.

Can we distill models from the same family but different d_model sizes? If so, what would be a way to fix this error?
Does the code support distilling encoder-decoder models using minillm? Or are there some challenges that the authors foresee in doing so?

try to employ the deepspeed-zero2

When the following code is executed, an error occurs

model, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model,
optimizer=optimizer,
args=args,
lr_scheduler=lr_scheduler,
mpu=None,
config_params=ds_config
)

Uprise: load persistent id instruction was encountered

I'm trying to run the generate_dense_embeddings script with the following command

python DPR/generate_dense_embeddings.py  model_file=/root/LMOps/uprise/archive/data.pkl  ctx_src=dpr_uprise shard_id=0 num_shards=1  out_file=$PWD/my_data/experiment/uprise/dpr_enc_index  ctx_sources.dpr_uprise.prompt_pool_path=${PROMPT_POOL}  ctx_sources.dpr_uprise.prompt_setup_type=qa  encoder.cache_dir=${CACHE_DIR}  hydra.run.dir=$PWD/my_data/experiment/uprise

But I'm getting the below error.

[2023-10-17 13:48:09,195][root][INFO] - Reading saved model from /root/LMOps/uprise/archive/data.pkl
Error executing job with overrides: ['model_file=/root/LMOps/uprise/archive/data.pkl', 'ctx_src=dpr_uprise', 'shard_id=0', 'num_shards=1', 'out_file=/root/LMOps/uprise/my_data/experiment/uprise/dpr_enc_index', 'ctx_sources.dpr_uprise.prompt_pool_path=prompt_pool.json', 'ctx_sources.dpr_uprise.prompt_setup_type=qa', 'encoder.cache_dir=cache/']
Traceback (most recent call last):
  File "DPR/generate_dense_embeddings.py", line 106, in main
    saved_state = load_states_from_checkpoint(cfg.model_file)
  File "/root/LMOps/uprise/DPR/dpr/utils/model_utils.py", line 170, in load_states_from_checkpoint
    state_dict = torch.load(
  File "/root/LMOps/uprise/.env/lib/python3.8/site-packages/torch/serialization.py", line 1028, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/root/LMOps/uprise/.env/lib/python3.8/site-packages/torch/serialization.py", line 1246, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

Could you provide the ROBERTA training corpus in MiniLLM?

Could you please let me know if it is possible to share the dataset? If so, I would greatly appreciate it if you could provide me with information on where I can find it. Currently it is missing in the data.tar archive. Thank you!

TypeError: 'Loss.pt_loss() missing 1 required positional argument: 'logits'

It appears there is a bug in the source code related to a TypeError: 'Loss.pt_loss() missing 1 required positional argument: 'logits'' within the 'evaluate_pt' function in 'trainer.py'. Would it be possible to have any suggestion on that? Thank you!

I am getting "NameError: name 'overall_cls' is not defined" error when I run python raw2read.py

Hello all, when I run python raw2read.py I am getting "NameError: name 'overall_cls' error. Here I am providing part log.
Help me in fixing in this issue.

PS C:\Users\rajas\Desktop\AI_Research\LMOps-main\LMOps-main\adaptllm> python raw2read.py
max_workers: 12
loading raw texts in the input folder...
paths: ['./data_samples/input-raw-texts\0.txt', './data_samples/input-raw-texts\1.txt', './data_samples/input-raw-texts\10.txt', './data_samples/input-raw-texts\11.txt', './data_samples/input-raw-texts\2.txt', './data_samples/input-raw-texts\3.txt', './data_samples/input-raw-texts\4.txt', './data_samples/input-raw-texts\5.txt', './data_samples/input-raw-texts\6.txt', './data_samples/input-raw-texts\7.txt', './data_samples/input-raw-texts\8.txt', './data_samples/input-raw-texts\9.txt']
12it [00:00, ?it/s]
transferring raw texts into reading comprehension...
0%| | 0/12 [00:00<?, ?it/s]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\rajas\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\process.py", line 256, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rajas\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\process.py", line 205, in _process_chunk
return [fn(*args) for args in chunk]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rajas\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\process.py", line 205, in
return [fn(*args) for args in chunk]
^^^^^^^^^
File "C:\Users\rajas\Desktop\AI_Research\LMOps-main\LMOps-main\adaptllm\raw2read.py", line 19, in search
context_wo_title = overall_cls.truncate_sentence(context_wo_title, max_len=overall_cls.max_seq_len-200)
^^^^^^^^^^^
NameError: name 'overall_cls' is not defined

Thanks in advance

Facing issue with pt_loss computation while evaluation in minillm trainer.

Thank you very much for sharing the code.

I am trying to run the scripts when I am facing the following issue in minillm/trainer.py script.

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/train_minillm.py", line 99, in <module>
    main()
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/train_minillm.py", line 85, in main
    train(
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/__init__.py", line 50, in train
    trainer.train()
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/trainer.py", line 306, in train
    self.evaluate()
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/trainer.py", line 408, in evaluate
    eval_pt_results = self.evaluate_pt()
  File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/trainer.py", line 527, in evaluate_pt
    _, stats = self.losses.pt_loss(batch)
TypeError: Loss.pt_loss() missing 1 required positional argument: 'logits'

It seems that the Loss.pt_loss() method requires logits as well. What would be the right way to fix this error?

[minillm] apply minillm to other LLMs like Baichuan/Qianwen

Thank you for your awesome work minillm that explored the knowledge distillation for LLMs. I noticed that minillm supports the gpt2/gptj/opt and llama series models only, my question is how should I do if I want to extend it to more recently released LLMs like Baichuan-7B and Qwen-7B whose model architectures are different from llama/opt/gpt2?
I noticed that for big student/teacher models , it's necessary to split them into 4 parts to fit the A100 GPUs, so you provide folders like:
transformers/src/transformers/models/opt_parallel/
transformers/src/transformers/models/gpt2_parallel/
transformers/src/transformers/models/llama_parallel/
with necessary files inside, mainly 'modeling_xxx_parallel.py' and 'utils_xxx.py', to implement model split and parallel modeling,
However, when I checked the latest huggingface/transformers repository and didn't find those 'xxx_parallel/' folders. Is that mean those 'xxx_parallel/' folders and files inside are developed by yourself? And if I want to extend minillm to Baichuan or Qwen, do I need to development corresponding codes by myself, like, let's say 'baichuan_parallel/' and 'qwen_parallel/'?
Thank you!

code releasing of llm_retriever

Looking forward to your code releasing of llm_retriever :)

Code issues on running MiniLLM

First of all, thank you for sharing your codes.

When I tried to run sft and kd on gpt2, it works.
However, when I tried to run minillm, I encounter two problems.

The first problem is that

  File "/home/work/kd/minillm/minillm/pipelines.py", line 82, in collate
    no_model_batch["full_ids"][:len(full_ids)-1] = torch.tensor(full_ids[:-1], dtype=torch.long)
RuntimeError: The expanded size of the tensor (512) must match the existing size (60) at non-singleton dimension 1.  Target sizes: [16, 512].  Tensor sizes: [60]

In my thought the line 82-84 should be change from

no_model_batch["full_ids"][:len(full_ids)-1] = torch.tensor(full_ids[:-1], dtype=torch.long)
no_model_batch["full_attention_mask"][:len(full_ids)-1] = 1.0
no_model_batch["full_label_ids"][len(prompt)-1:len(full_ids)-1] = torch.tensor(response, dtype=torch.long)

no_model_batch["full_ids"][i][:len(full_ids)-1] = torch.tensor(full_ids[:-1], dtype=torch.long)
no_model_batch["full_attention_mask"][i][:len(full_ids)-1] = 1.0
no_model_batch["full_label_ids"][i][len(prompt)-1:len(full_ids)-1] = torch.tensor(response, dtype=torch.long)

The second problem is that

It seems there are two type of preprocessed dolly datasets, full and propmt.
Full dataset is used for sft and kd and prompt dataset is used for minillm.

However, when I run

bash scripts/gpt2/tools/process_data_dolly.sh

It only returns one type of preprocessed dolly.
therefore, when I run the minillm, I have another errors that

File ".//train_minillm.py", line 85, in main
    assert len(data) <= self.max_prompt_length
AssertionError
    train(
  File "/home/work/kd/minillm/minillm/__init__.py", line 37, in train
    sampler.run_sample(args.num_rollouts)
  File "/home/work/kd/minillm/minillm/sampler.py", line 47, in run_sample
    batch: PromptBatch = next(self.pipeline_iterator)

could you please check the problem and give some solution?

Thank you.

Distillation loss produces NaNs

For the following distillation loss that's being used in the repo, do you ever face NaN issues?

teacher_probs = F.softmax(teacher_logits, dim=-1, dtype=torch.float32)
inf_mask = torch.isinf(logits)
logprobs = F.log_softmax(logits, dim=-1, dtype=torch.float32)
prod_probs = torch.masked_fill(teacher_probs * logprobs, inf_mask, 0)
x = torch.sum(prod_probs, dim=-1).view(-1)
mask = (no_model_batch["label"] != -100).int()
distil_loss = -torch.sum(x * mask.view(-1), dim=0) / torch.sum(mask.view(-1), dim=0)

Although it works fine for the first couple of hundreds iteration but then starts producing NaNs

Is LLMA can improve model effectiveness in semantic retrieval?

there is 500 class of output .give model a sentence, and infere what class it is. Can I use LLMA in chatglm to accelerate and imporve accuracy?

[minillm] hard to reproduce the result

I try to distill gpt2-1.5B -> gpt2-120M
As I use 4 A100, so I change the GPUS_PER_NODE to ${3-4}

Batch size remains the same

Directory of "structured_prompting/fairseq-version/fairseq" is not installable.

A bug of
"structured_prompting/fairseq-version/fairseq does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found." accused when I run
"pip install --user -e fairseq/"

[minillm] how to resume training when it occurs some bugs during running file 'train_minillm.py'?

I ran 'train_minillm.py' successfully under the guidance of the README.md file. However, due to some uncontrollable factors, the GPU will interrupt approximately every 6-8 hours. At this time, the locally saved files are shown in the following figure. How should I continue the interrupted training? Thanks a lot!

BreadcrumbsLMOps/prompt_optimization /tasks.py EthosBinaryTask question

module: EthosBinaryTask

two questions:

df = df[(df[1] <= 0) | (df[1] >= 0.7)]: why using this condition to filter the data, the condition not appear in paper
exs = [{'id': x['index'], 'text': x[0], 'label': 1 if x[1] > 0.4 else 0} for x in exs[:200]]: the test count is not consistent with paper 150?

A Environment Bug in miniLLM

In minillm, there is something wrong with the given environment.
The deepspeed==0.8.0 will make pydantic==2.0+ installed in the same time.
But in this combination of version, a naive code 'import deepspeed' fails.
I must change the version of pydantic to 1.10.11, then it seems ok.

Uprise: Error while running inference

I'm running the inference script with bash inference_hf.sh. But I'm getting some error related to path.

[2023-10-17 18:06:41,654][root][INFO] - Total encoded queries tensor torch.Size([277, 768])
[2023-10-17 18:06:41,655][dpr.data.retriever_data][INFO] - prompt files: 
Error executing job with overrides: ['model_file=/root/LMOps/uprise/retriever_ckpt', 'qa_dataset=qa_uprise', 'ctx_datatsets=[dpr_uprise]', 'encoded_ctx_files=[/root/LMOps/uprise/my_data/experiment/uprise/dpr_enc_index_*]', 'out_file=/root/LMOps/uprise/my_data/experiment/uprise/rte_prompts.json', 'datasets.qa_uprise.task_name=rte', 'datasets.qa_uprise.cache_dir=', 'n_docs=3', 'ctx_sources.dpr_uprise.prompt_pool_path=', 'ctx_sources.dpr_uprise.prompt_setup_type=qa', 'encoder.cache_dir=']
Error in call to target 'dpr.data.retriever_data.UpriseCtxSrc':
FileNotFoundError(2, 'No such file or directory')
full_key: ctx_sources.dpr_uprise

I noticed that there is no file /root/LMOps/uprise/my_data/experiment/uprise/rte_prompts.json. How can I create a sample file for that?

[minillm] typo in bash

https://github.com/microsoft/LMOps/blob/80d7d4a0ba8d61ca7be6cae72d06cf71dda3e9e0/minillm/scripts/gpt2/eval/eval_main_self_inst.sh#L18C32-L18C32

CKPT="${BASE_PATH}/results/gpt2/${CKPT_NAME}/"

It should be CKPT="${BASE_PATH}/results/gpt2/train/${CKPT_NAME}/"
The train is missed

[MiniLLM] sft training loss of llama-7b did not decrease with multi nodes.

I tried to use a simlar dataset alpaca-zh to sft the llama-7b on 16 x 32G v100 gpus. gpu_per_node=8 ,node_num=2.
The script I use is scripts/llama/sft/sft_7B.sh.
But the training loss did not decrease if I use --deepspeed_config ${BASE_PATH}/configs/deepspeed/ds_config_zero2.json".
Even if I switch the learning rate and weight_decay, there is no difference. The train loss did not decrease, and the val rougeL score decrease with training.

So I switch to use only 8 gpu (one node) to sft llama-7b.
I have to change the deepspeed config to train llama-7b on a node(8 gpus) because it will run out of memory if I still use the above deepspeed config. The new config I use to reduce memory is as follows:

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "zero_force_ds_cpu_optimizer": false,
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 11,
        "loss_scale_window": 5000,
        "hysteresis": 4
    },
    "wall_clock_breakdown": false
}

when I use only one single node (8 v100 gpus) to run this script. The training loss of llama-7b decrease normally.

Besides, the sft of gpt-base/ gpt-xl/opt-1.5b(trained on 8 gpus) are normal, but the sft of opt-13b(trained on 16 gpus) faced the same problem as llama-7b(trained on 16 gpus) .

So I guess this has something to do with the the multi node training.

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

Hi,

Thanks for the outstanding paper "Why Can GPT Learn In-Context?".

I don't know if this is the right place to talk about this, but I might've found a typo in the paper and want to let you know about it.

In equation (8), you have the transpose as in ${x'_i}^\top$.

However, you have the outer product symbol $\otimes$, which by definition, handles the transpose.

So, the transpose operation is only necessary if you mean the outer product symbol is matrix multiplication. Otherwise, removing the outer product symbol may be better, as it is eventually done in equation (9).

I hope it's helpful to you.

[minillm] is it possible to split a huge model into 3 or 5 parts instead of 4?

I noticed that the minillm algorithm usually splits a whole checkpoint into 4 equal parts using tools/convert_mp.py when dealing with larger LLM models like llama13B. During loading, each of the 4 A100 GPUs is assigned to load one of these parts using DeepSpeed. However, I currently only have 3 A100 GPUs, and it's clear that llama's vocab_size=32k cannot be evenly divided by 3, and some intermediate variables like hidden_size may also not be divisible by 3. My question is, can I split the whole checkpoint into uneven parts for the 3 GPUs (e.g., making 2 parts equal and the third part smaller), so that training can still be accomplished on the 3 GPUs (assuming sufficient memory)? If so, would the training code need to be modified in certain places? Thank you!

Question of Equation 11

Thanks for your great work,There are two questions that I don't understand. I want to ask you for advice
(1)As Mentioned in the paper [X‘; X]denotes the matrix concatenation, I want to know how are they connected, Is it in the channel dimension or more like the batch dimension？
(2)How is this step in Equation 11 derived？thanks

Dataset and training code ?

Hi, thanks for you great work about "Promptist: reinforcement learning for automatic prompt optimization".

Will you release the dataset the paper mentioned in the paper and the training code?

Structured Prompting: GPT_neo_modeling.py

causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]

but in Structured Prompting the key_length exceeds the max_positions.

How to address this issue. Thank you.

Can you provide OpenLLaMA weights?

Hi,

Can you provide OpenLLaMA weights with the same settings that this paper has ran on the original LLaMA. Asking because LLaMA has a restrictive license usage.

Thanks.

[miniLLM] About KL divergence

hi, thanks for your nice paper.

For KL, the code is

def get_rev_kl(log_p, log_q, mask):
log_ratio = (log_p - log_q) * mask
kl = log_ratio.float().exp() - 1 - log_ratio
return kl

may I know why log_ratio.float().exp() - 1 - log_ratio ? Thanks

Models about the work: 'Why Can GPT Learn In-Context? '

Much thanks for your excellent work about:
Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers

I tried to find the model used by the paper in the fairseq written in the paper. But I can't find any models in this repository that contain the word 'gpt'. Can you tell me more details about the models (like the official website)?

Thanks！

[miniLLM] The evaluation might be wrong when using dp_size > 1

At the evaluation phase of llama-7b/gpt2-xlarge whose MP_size=1, I try to use 8 gpus to accelerate the evaluation phase. The code is scripts/gpt2/eval/run_eval.sh.
I simplify this code to only evaluate on one task. The gpu_num=8 which by default is 1.

base_path=${1-"/home/MiniLLM"}
port=2040

ckpt_base_path=/xx/LMOps/minillm/results/gpt2/train/

for data in alpaca_zh
do
    # Evaluate SFT
    for seed in 10
    do
        ckpt="sft/gpt2-base"
        ckpt=$ckpt_base_path"/"$ckpt
     
        gpu_num=8 # this is wrong
        gpu_num=1 # this is normal
        bash ${base_path}/scripts/gpt2/eval/eval_main_${data}.sh ${base_path} ${port} ${gpu_num} ${ckpt} --seed $seed  --eval-batch-size 8
    done

If I use gpu_num=1, the evaluation is fine. The final rouge value is normal. But for gpu_num=8, the rouge is much lower than ecpected. And the former rouge is consistent with that of the training-time evaluation rouge.

I check results/gpt2/eval_main/alpaca_zh-512/xxx/answers.jsonl for more details.
And I found that there are only 63 lines of responses for the 8 gpu-evaluation setting. But for the 1gpu settting, the line number is 500, which is the exact number of valid set. I think the dp_size >1 might be the cause of this problem.

For llama-13b whose MP_size=4, if I use gpu_num=4, the validation is normal, but wrong if gpu_num=8.
My evaluation code of alpaca_zh is very similar to that of dolly. I guess this problem might exist for other dataset like dolly too.

when I run the hf-version on the fairseq-dense-125M ,there is error ""TypeError: forward() got an unexpected keyword argument 'prefix_parallel'

when I run the hf-version on the fairseq-dense-125M ,there is error ""model has no attribution of parralel"",when I remove the param --parallel, there is another error ""TypeError: forward() got an unexpected keyword argument 'prefix_parallel'.

[minillm] how to eval sft/llama-13B with 1 A100 GPU or 4 A10 GPUs?

I want to run scripts/llama/eval/eval_main_dolly.sh to evaluate sft/llama-13B, I have access to 1 A100 gpu OR 4 A10 gpus, how should I modify the scripts/llama/eval/eval_main_dolly.sh file to get it work?
I tried the following order on 1 A100 gpu:

python evaluate.py --base-path /data/LMOps/minillm --model-path checkpoints/llama/train/sft/llama-13B/ --ckpt-name sft/llama-13B --n-gpu 1 --model-type llama --data-dir /data/LMOps/minillm/data/dolly --data-names dolly --num-workers 0 --dev-num -1 --data-process-workers -1 --json-data --eval-batch-size 8 --max-length 512 --max-prompt-length 256 --do-eval --save /data/LMOps/minillm/checkpoints/llama/eval_main/ --seed 10 --deepspeed --deepspeed_config /data/LMOps/minillm/configs/deepspeed/ds_config.json --type eval_main --do-sample --top-k 0 --top-p 1.0 --temperature 1.0

and files in checkpoints/llama/train/sft/llama-13B/ are:

where pytorch_model.bin is converted from mp4/ using the released file tools/convert_mp.py
however, it gives bugs as following:

Traceback (most recent call last):
  File "/nlp_data/work/chentianyang/minillm/evaluate.py", line 145, in <module>
    main()
  File "/nlp_data/work/chentianyang/minillm/evaluate.py", line 138, in main
    evaluate_main(args, tokenizer, model, dataset["test"], "test", 0, device)       # eval core code
  File "/nlp_data/work/chentianyang/minillm/evaluate_main.py", line 161, in evaluate_main
    lm_loss, query_ids, response_ids, t_used_avg = run_model(args, tokenizer, model, dataset, epoch, device)    # lm_loss: 整个test集500个句子的average loss
  File "/nlp_data/work/chentianyang/minillm/evaluate_main.py", line 118, in run_model
    gen_out = model.generate(
  File "/root/anaconda3/envs/lmops/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/generation/utils.py", line 1454, in generate
    return self.sample(
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/generation/utils.py", line 2500, in sample
    world_size=mpu.get_model_parallel_world_size(),
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/mpu/initialize.py", line 92, in get_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_model_parallel_group())
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/mpu/initialize.py", line 78, in get_model_parallel_group
    assert _MODEL_PARALLEL_GROUP is not None, \
AssertionError: model parallel group is not initialized

How to slove the problem?

modeling_xglm for Structured Prompting

If it is possible to update the scaled attention of transformers.models.xglm.modeling_xglm for Structured Prompting: https://github.com/microsoft/LMOps/tree/main/structured_prompting/hf-version
thank you