microsoft / lmops Goto Github PK
View Code? Open in Web Editor NEWGeneral technology for enabling AI capabilities w/ LLMs and MLLMs
Home Page: https://aka.ms/GeneralAI
License: MIT License
General technology for enabling AI capabilities w/ LLMs and MLLMs
Home Page: https://aka.ms/GeneralAI
License: MIT License
For the hf version of Structured Prompting (modeling_opt.py):
let:
config.max_position_embeddings=2048
config. hidden_size=768
self.embed_positions = OPTLearnedPositionalEmbedding(config.max_position_embeddings, config.hidden_size)
past_key_values_length = 3000
attention_mask.size() is (1, 3400)
pos_embeds = self.embed_positions(attention_mask, past_key_values_length)
Now both past_key_values_length and attention_mask exceed the max_position_embeddings. In this situation, what is the solution thank you
Hi, just wonder if it is possible to do distillation between two model with different tokenizers. the two tokenizers can be different in vocabulary size or have tokens in different position.
At the evaluation phase of llama-7b/gpt2-xlarge whose MP_size=1
, I try to use 8 gpus to accelerate the evaluation phase. The code is scripts/gpt2/eval/run_eval.sh
.
I simplify this code to only evaluate on one task. The gpu_num=8
which by default is 1
.
base_path=${1-"/home/MiniLLM"}
port=2040
ckpt_base_path=/xx/LMOps/minillm/results/gpt2/train/
for data in alpaca_zh
do
# Evaluate SFT
for seed in 10
do
ckpt="sft/gpt2-base"
ckpt=$ckpt_base_path"/"$ckpt
gpu_num=8 # this is wrong
gpu_num=1 # this is normal
bash ${base_path}/scripts/gpt2/eval/eval_main_${data}.sh ${base_path} ${port} ${gpu_num} ${ckpt} --seed $seed --eval-batch-size 8
done
If I use gpu_num=1
, the evaluation is fine. The final rouge value is normal. But for gpu_num=8
, the rouge is much lower than ecpected. And the former rouge is consistent with that of the training-time evaluation rouge.
I check results/gpt2/eval_main/alpaca_zh-512/xxx/answers.jsonl
for more details.
And I found that there are only 63 lines of responses for the 8 gpu-evaluation setting. But for the 1gpu settting, the line number is 500, which is the exact number of valid set. I think the dp_size >1 might be the cause of this problem.
For llama-13b whose MP_size=4
, if I use gpu_num=4
, the validation is normal, but wrong if gpu_num=8
.
My evaluation code of alpaca_zh is very similar to that of dolly. I guess this problem might exist for other dataset like dolly too.
Hello all, when I run python raw2read.py I am getting "NameError: name 'overall_cls' error. Here I am providing part log.
Help me in fixing in this issue.
PS C:\Users\rajas\Desktop\AI_Research\LMOps-main\LMOps-main\adaptllm> python raw2read.py
max_workers: 12
loading raw texts in the input folder...
paths: ['./data_samples/input-raw-texts\0.txt', './data_samples/input-raw-texts\1.txt', './data_samples/input-raw-texts\10.txt', './data_samples/input-raw-texts\11.txt', './data_samples/input-raw-texts\2.txt', './data_samples/input-raw-texts\3.txt', './data_samples/input-raw-texts\4.txt', './data_samples/input-raw-texts\5.txt', './data_samples/input-raw-texts\6.txt', './data_samples/input-raw-texts\7.txt', './data_samples/input-raw-texts\8.txt', './data_samples/input-raw-texts\9.txt']
12it [00:00, ?it/s]
transferring raw texts into reading comprehension...
0%| | 0/12 [00:00<?, ?it/s]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\rajas\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\process.py", line 256, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rajas\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\process.py", line 205, in _process_chunk
return [fn(*args) for args in chunk]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rajas\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\process.py", line 205, in
return [fn(*args) for args in chunk]
^^^^^^^^^
File "C:\Users\rajas\Desktop\AI_Research\LMOps-main\LMOps-main\adaptllm\raw2read.py", line 19, in search
context_wo_title = overall_cls.truncate_sentence(context_wo_title, max_len=overall_cls.max_seq_len-200)
^^^^^^^^^^^
NameError: name 'overall_cls' is not defined
Thanks in advance
First of all, thank you for sharing your codes.
When I tried to run sft and kd on gpt2, it works.
However, when I tried to run minillm, I encounter two problems.
File "/home/work/kd/minillm/minillm/pipelines.py", line 82, in collate
no_model_batch["full_ids"][:len(full_ids)-1] = torch.tensor(full_ids[:-1], dtype=torch.long)
RuntimeError: The expanded size of the tensor (512) must match the existing size (60) at non-singleton dimension 1. Target sizes: [16, 512]. Tensor sizes: [60]
In my thought the line 82-84 should be change from
no_model_batch["full_ids"][:len(full_ids)-1] = torch.tensor(full_ids[:-1], dtype=torch.long)
no_model_batch["full_attention_mask"][:len(full_ids)-1] = 1.0
no_model_batch["full_label_ids"][len(prompt)-1:len(full_ids)-1] = torch.tensor(response, dtype=torch.long)
to
no_model_batch["full_ids"][i][:len(full_ids)-1] = torch.tensor(full_ids[:-1], dtype=torch.long)
no_model_batch["full_attention_mask"][i][:len(full_ids)-1] = 1.0
no_model_batch["full_label_ids"][i][len(prompt)-1:len(full_ids)-1] = torch.tensor(response, dtype=torch.long)
.
It seems there are two type of preprocessed dolly datasets, full and propmt.
Full dataset is used for sft and kd and prompt dataset is used for minillm.
However, when I run
bash scripts/gpt2/tools/process_data_dolly.sh
It only returns one type of preprocessed dolly.
therefore, when I run the minillm, I have another errors that
File ".//train_minillm.py", line 85, in main
assert len(data) <= self.max_prompt_length
AssertionError
train(
File "/home/work/kd/minillm/minillm/__init__.py", line 37, in train
sampler.run_sample(args.num_rollouts)
File "/home/work/kd/minillm/minillm/sampler.py", line 47, in run_sample
batch: PromptBatch = next(self.pipeline_iterator)
could you please check the problem and give some solution?
Thank you.
I ran 'train_minillm.py' successfully under the guidance of the README.md file. However, due to some uncontrollable factors, the GPU will interrupt approximately every 6-8 hours. At this time, the locally saved files are shown in the following figure. How should I continue the interrupted training? Thanks a lot!
I'm running the inference script with bash inference_hf.sh
. But I'm getting some error related to path.
[2023-10-17 18:06:41,654][root][INFO] - Total encoded queries tensor torch.Size([277, 768])
[2023-10-17 18:06:41,655][dpr.data.retriever_data][INFO] - prompt files:
Error executing job with overrides: ['model_file=/root/LMOps/uprise/retriever_ckpt', 'qa_dataset=qa_uprise', 'ctx_datatsets=[dpr_uprise]', 'encoded_ctx_files=[/root/LMOps/uprise/my_data/experiment/uprise/dpr_enc_index_*]', 'out_file=/root/LMOps/uprise/my_data/experiment/uprise/rte_prompts.json', 'datasets.qa_uprise.task_name=rte', 'datasets.qa_uprise.cache_dir=', 'n_docs=3', 'ctx_sources.dpr_uprise.prompt_pool_path=', 'ctx_sources.dpr_uprise.prompt_setup_type=qa', 'encoder.cache_dir=']
Error in call to target 'dpr.data.retriever_data.UpriseCtxSrc':
FileNotFoundError(2, 'No such file or directory')
full_key: ctx_sources.dpr_uprise
I noticed that there is no file /root/LMOps/uprise/my_data/experiment/uprise/rte_prompts.json
. How can I create a sample file for that?
Much thanks for your excellent work about:
Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers
I want to know if you have verified that is W_{ICL} similar to W_{FT}, which may more directly verify the relationship between In-context learning and fine-tuning, and further demonstrate the motivation in the article.
Hello!
Thank you for open sourcing this amazing work. We are trying to distill encoder-decoder architectures like flan-t5 using minillm. However, we are facing the following issue when distilling flan-t5-xl to flan-t5-large.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/train_minillm.py", line 109, in <module>
main()
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/train_minillm.py", line 95, in main
train(
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/__init__.py", line 37, in train
sampler.run_sample(args.num_rollouts)
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/sampler.py", line 71, in run_sample
gen_out = self.trainer.generate(**batch, return_dict_in_generate=True, mode=mode, teacher_mixed_sample=(self.args.teacher_mixed_alpha is not None), output_scores=True)
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/trainer.py", line 620, in generate
gen = model.generate(
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/model.py", line 29, in generate
return self.base_model.generate(**x)
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/generation/utils.py", line 1580, in generate
return self.sample(
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/generation/utils.py", line 2704, in sample
m_outputs = mix_in_model(
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 1720, in forward
decoder_outputs = self.decoder(
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 1090, in forward
layer_outputs = layer_module(
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 723, in forward
cross_attention_outputs = self.layer[1](
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 634, in forward
attention_output = self.EncDecAttention(
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 522, in forward
key_states = project(
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/transformers/src/transformers/models/t5/modeling_t5.py", line 500, in project
hidden_states = shape(proj_layer(key_value_states))
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x1024 and 2048x2048)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11072) of binary: /home/ec2-user/SageMaker/tanayn/kd/bin/python3.10
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/tanayn/kd/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ec2-user/SageMaker/tanayn/kd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I believe this is happening since the d_model size for both the models differ (1024 for flan-t5-large and 2048 for flan-t5-xl).
So, I have two questions.
In minillm, there is something wrong with the given environment.
The deepspeed==0.8.0 will make pydantic==2.0+ installed in the same time.
But in this combination of version, a naive code 'import deepspeed' fails.
I must change the version of pydantic to 1.10.11, then it seems ok.
can you please modify modeling_xgml.py . In such way, we can you your code with huggingface. Thank you
Looking forward to your code releasing of llm_retriever :)
A bug of
"structured_prompting/fairseq-version/fairseq does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found." accused when I run
"pip install --user -e fairseq/"
Thank you for your awesome work minillm that explored the knowledge distillation for LLMs. I noticed that minillm supports the gpt2/gptj/opt and llama series models only, my question is how should I do if I want to extend it to more recently released LLMs like Baichuan-7B and Qwen-7B whose model architectures are different from llama/opt/gpt2?
I noticed that for big student/teacher models , it's necessary to split them into 4 parts to fit the A100 GPUs, so you provide folders like:
transformers/src/transformers/models/opt_parallel/
transformers/src/transformers/models/gpt2_parallel/
transformers/src/transformers/models/llama_parallel/
with necessary files inside, mainly 'modeling_xxx_parallel.py' and 'utils_xxx.py', to implement model split and parallel modeling,
However, when I checked the latest huggingface/transformers repository and didn't find those 'xxx_parallel/' folders. Is that mean those 'xxx_parallel/' folders and files inside are developed by yourself? And if I want to extend minillm to Baichuan or Qwen, do I need to development corresponding codes by myself, like, let's say 'baichuan_parallel/' and 'qwen_parallel/'?
Thank you!
I want to run scripts/llama/eval/eval_main_dolly.sh to evaluate sft/llama-13B, I have access to 1 A100 gpu OR 4 A10 gpus, how should I modify the scripts/llama/eval/eval_main_dolly.sh file to get it work?
I tried the following order on 1 A100 gpu:
python evaluate.py --base-path /data/LMOps/minillm --model-path checkpoints/llama/train/sft/llama-13B/ --ckpt-name sft/llama-13B --n-gpu 1 --model-type llama --data-dir /data/LMOps/minillm/data/dolly --data-names dolly --num-workers 0 --dev-num -1 --data-process-workers -1 --json-data --eval-batch-size 8 --max-length 512 --max-prompt-length 256 --do-eval --save /data/LMOps/minillm/checkpoints/llama/eval_main/ --seed 10 --deepspeed --deepspeed_config /data/LMOps/minillm/configs/deepspeed/ds_config.json --type eval_main --do-sample --top-k 0 --top-p 1.0 --temperature 1.0
and files in checkpoints/llama/train/sft/llama-13B/ are:
where pytorch_model.bin is converted from mp4/ using the released file tools/convert_mp.py
however, it gives bugs as following:
Traceback (most recent call last):
File "/nlp_data/work/chentianyang/minillm/evaluate.py", line 145, in <module>
main()
File "/nlp_data/work/chentianyang/minillm/evaluate.py", line 138, in main
evaluate_main(args, tokenizer, model, dataset["test"], "test", 0, device) # eval core code
File "/nlp_data/work/chentianyang/minillm/evaluate_main.py", line 161, in evaluate_main
lm_loss, query_ids, response_ids, t_used_avg = run_model(args, tokenizer, model, dataset, epoch, device) # lm_loss: 整个test集500个句子的average loss
File "/nlp_data/work/chentianyang/minillm/evaluate_main.py", line 118, in run_model
gen_out = model.generate(
File "/root/anaconda3/envs/lmops/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/generation/utils.py", line 1454, in generate
return self.sample(
File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/generation/utils.py", line 2500, in sample
world_size=mpu.get_model_parallel_world_size(),
File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/mpu/initialize.py", line 92, in get_model_parallel_world_size
return torch.distributed.get_world_size(group=get_model_parallel_group())
File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/mpu/initialize.py", line 78, in get_model_parallel_group
assert _MODEL_PARALLEL_GROUP is not None, \
AssertionError: model parallel group is not initialized
How to slove the problem?
Hey, I'm following the guide in the readme to run the manyshots structured prompting example, but it doesn't seem to start up properly with an import error within fairseq
.
I'm running in a local (rather than docker) environment, the relevant packages set up with conda, tried it with Python 3.8.15 an 3.9.15.
This is the error message I get:
Traceback (most recent call last):
File "validate.py", line 9, in <module>
import struprompting
File "/project/gergely/LMOps/structured_prompting/fairseq-version/struprompting/__init__.py", line 1, in <module>
import struprompting.models
File "/project/gergely/LMOps/structured_prompting/fairseq-version/struprompting/models/__init__.py", line 3, in <module>
from fairseq.models import import_models
File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/__init__.py", line 235, in <module>
import_models(models_dir, "fairseq.models")
File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/__init__.py", line 217, in import_models
importlib.import_module(namespace + "." + model_name)
File "/opt/anaconda/envs/fairseq/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/hubert/__init__.py", line 6, in <module>
from .hubert import * # noqa
File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/hubert/hubert.py", line 20, in <module>
from fairseq.models.wav2vec.wav2vec2 import (
File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/wav2vec/__init__.py", line 6, in <module>
from .wav2vec import * # noqa
File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/models/wav2vec/wav2vec.py", line 25, in <module>
from fairseq.tasks import FairseqTask
File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/tasks/__init__.py", line 15, in <module>
from .fairseq_task import FairseqTask, LegacyFairseqTask # noqa
File "/project/gergely/LMOps/structured_prompting/fairseq-version/fairseq/fairseq/tasks/fairseq_task.py", line 13, in <module>
from fairseq import metrics, search, tokenizer, utils
ImportError: cannot import name 'metrics' from 'fairseq' (unknown location)
Looks like this happens because when running the script, Python is trying to load the requested fairseq
lib from the local folder within that folder where the whole example is (structured_prompting/fairseq-version/
), and not using the version that was installed (with the README's pip install --user -e fairseq/
line). That is, the installed library and the local folder are clashing during script import time.
The script also cannot be run from another folder, as it wants to use the validate.py
file,and thus the above clash cannot be resolved by just moving into another folder.
I managed to make this work by renaming the fairseq
folder and reinstalling the module from the new, non-clashing name. Would this be the ultimate solution? Or am I missing something / doing something incorrectly so this import confusion happens?
Much thanks for your excellent work about:
Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers
I tried to find the model used by the paper in the fairseq written in the paper. But I can't find any models in this repository that contain the word 'gpt'. Can you tell me more details about the models (like the official website)?
Thanks!
Hi, thanks for you great work about "Promptist: reinforcement learning for automatic prompt optimization".
Will you release the dataset the paper mentioned in the paper and the training code?
Could you please let me know if it is possible to share the dataset? If so, I would greatly appreciate it if you could provide me with information on where I can find it. Currently it is missing in the data.tar archive. Thank you!
Hi,
Can you provide OpenLLaMA weights with the same settings that this paper has ran on the original LLaMA. Asking because LLaMA has a restrictive license usage.
Thanks.
Sorry to bother you guys again.
With the lora script, I can use lora to speed up the training of llama-7b and llama-13b if I do not use model parallel
and MP=4
arugument ). The training is much faster with lora.
However, the peft library cannot be directly applied to The ParallelLlamaForCausalLM
class because some subclasses used by ParallelLlamaForCausalLM
is not supported by peft.
The following is the error message if I apply peft when I set model parallel
and MP=4
arugument.
ValueError: Target module ColumnParallelLinear() is not supported.
Currently, only `torch.nn.Linear` and `Conv1D` are supported.
However, when I try to use scripts/llama/minillm/train_7B_13B_lora.sh
to use the minillm
method on 2 x 8 v100 gpus. The student model llama-7b and teacher llama-13b are both trained by lora. I got Out of Memory error because by default model parallel
is not used. So I guess MP > 1 is needed.
i wonder how i can import from the local transformers folder. i seems the installed transformers will be considered priorly as i can not import module like ParallelLlamaForCausalLM.
when I run the hf-version on the fairseq-dense-125M ,there is error ""model has no attribution of parralel"",when I remove the param --parallel, there is another error ""TypeError: forward() got an unexpected keyword argument 'prefix_parallel'.
module: EthosBinaryTask
two questions:
It appears there is a bug in the source code related to a TypeError: 'Loss.pt_loss() missing 1 required positional argument: 'logits'' within the 'evaluate_pt' function in 'trainer.py'. Would it be possible to have any suggestion on that? Thank you!
Thanks for your great work,There are two questions that I don't understand. I want to ask you for advice
(1)As Mentioned in the paper [X‘; X]denotes the matrix concatenation, I want to know how are they connected, Is it in the channel dimension or more like the batch dimension?
(2)How is this step in Equation 11 derived?thanks
CKPT="${BASE_PATH}/results/gpt2/${CKPT_NAME}/"
It should be CKPT="${BASE_PATH}/results/gpt2/train/${CKPT_NAME}/"
The train is missed
I try to distill gpt2-1.5B -> gpt2-120M
As I use 4 A100, so I change the GPUS_PER_NODE to ${3-4}
Batch size remains the same
When I run this instruction bash scripts/opt/tools/process_data_dolly.sh /PATH/TO/MiniLLM # Process Dolly Train / Validation Data
,it has some error messages like 'huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/HardDisk/b00373/LMOps/minillm/checkpoints/opt-1.3B'. Use repo_type
argument if needed.'
how can I solve it,thanks a lot for youe help
I noticed that the minillm algorithm usually splits a whole checkpoint into 4 equal parts using tools/convert_mp.py when dealing with larger LLM models like llama13B. During loading, each of the 4 A100 GPUs is assigned to load one of these parts using DeepSpeed. However, I currently only have 3 A100 GPUs, and it's clear that llama's vocab_size=32k cannot be evenly divided by 3, and some intermediate variables like hidden_size may also not be divisible by 3. My question is, can I split the whole checkpoint into uneven parts for the 3 GPUs (e.g., making 2 parts equal and the third part smaller), so that training can still be accomplished on the 3 GPUs (assuming sufficient memory)? If so, would the training code need to be modified in certain places? Thank you!
there is 500 class of output .give model a sentence, and infere what class it is. Can I use LLMA in chatglm to accelerate and imporve accuracy?
Hi, are there any bugs with the splitting and merging code of the GPT2 series model?
In file 'transformers/src/transformers/models/gpt2_parallel/utils_gpt2.py', the model splitting function 'increase_mp_gpt2()' considered different situations thus appearing a little bit more complicated than that of llama and opt. I think that's because the way weights of layers stored in gpt2 is different from that of llama and opt. For example, Q/K/V of llama and opt are stored as different matrices individually with size of [hidden_dim, hidden_dim], while QKV of gpt2 are stored together as one single matrix with size of [hidden_dim, 3*hidden_dim].
However, the gpt2's model merging function 'decrease_mp_gpt2()' does not match the 'increase_mp_gpt2()'. GPT2's 'decrease_mp_gpt2()' is totally same as that of llama and opt, without considering the difference mentioned above. Is that correct?
Thank you for your reply!
Thank you very much for sharing the code.
I am trying to run the scripts when I am facing the following issue in minillm/trainer.py script.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/train_minillm.py", line 99, in <module>
main()
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/train_minillm.py", line 85, in main
train(
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/__init__.py", line 50, in train
trainer.train()
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/trainer.py", line 306, in train
self.evaluate()
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/trainer.py", line 408, in evaluate
eval_pt_results = self.evaluate_pt()
File "/home/ec2-user/SageMaker/tanayn/llm-exploration/knowledge-distillation/minillm/minillm/trainer.py", line 527, in evaluate_pt
_, stats = self.losses.pt_loss(batch)
TypeError: Loss.pt_loss() missing 1 required positional argument: 'logits'
It seems that the Loss.pt_loss() method requires logits as well. What would be the right way to fix this error?
hi, thanks for your nice paper.
For KL, the code is
def get_rev_kl(log_p, log_q, mask):
log_ratio = (log_p - log_q) * mask
kl = log_ratio.float().exp() - 1 - log_ratio
return kl
may I know why log_ratio.float().exp() - 1 - log_ratio ? Thanks
I did sft (supervised fine-tune) on minillm with the polly dataset by the following order: bash scripts/gpt2/sft/sft_large.sh, and I got the running logs as below:
============================== EXP at 2023-07-24 11:19:35 ==============================
dev | avg_loss: 2.50592041015625 | {'exact_match': 0.0, 'rougeL': 7.2973}
......
[2023-07-24 11:53:25] train | epoch 0 | Iter: 1428/ 14290 | global iter: 1428/ 14290 | loss: 1.8361 | ds_loss: 0.0000 | lr: 4.8781e-05 | scale: 2048.0000 | micro time: 1.180 | step time: 1.187
dev | avg_loss: 2.17413330078125 | {'exact_match': 2.6, 'rougeL': 19.8483}
......
[2023-07-24 12:26:31] train | epoch 1 | Iter: 2856/ 14290 | global iter: 2856/ 14290 | loss: 1.5378 | ds_loss: 0.0000 | lr: 4.5241e-05 | scale: 4096.0000 | micro time: 1.203 | step time: 1.194
dev | avg_loss: 2.283782958984375 | {'exact_match': 3.7, 'rougeL': 23.8785}
......
[2023-07-24 12:58:48] train | epoch 2 | Iter: 4284/ 14290 | global iter: 4284/ 14290 | loss: 0.6872 | ds_loss: 0.0000 | lr: 3.9729e-05 | scale: 8192.0000 | micro time: 1.192 | step time: 1.194
dev | avg_loss: 2.521240234375 | {'exact_match': 3.6, 'rougeL': 25.4109}
......
[2023-07-24 13:32:01] train | epoch 3 | Iter: 5716/ 14290 | global iter: 5716/ 14290 | loss: 0.4638 | ds_loss: 0.0000 | lr: 3.2760e-05 | scale: 8192.0000 | micro time: 1.182 | step time: 1.182
dev | avg_loss: 2.7791748046875 | {'exact_match': 3.9, 'rougeL': 26.4599}
......
[2023-07-24 14:04:43] train | epoch 4 | Iter: 7144/ 14290 | global iter: 7144/ 14290 | loss: 0.2177 | ds_loss: 0.0000 | lr: 2.5055e-05 | scale: 16384.0000 | micro time: 1.193 | step time: 1.190
dev | avg_loss: 2.98834228515625 | {'exact_match': 4.1, 'rougeL': 27.4338}
......
[2023-07-24 14:37:37] train | epoch 5 | Iter: 8572/ 14290 | global iter: 8572/ 14290 | loss: 0.1473 | ds_loss: 0.0000 | lr: 1.7361e-05 | scale: 16384.0000 | micro time: 1.183 | step time: 1.183
dev | avg_loss: 3.14642333984375 | {'exact_match': 4.0, 'rougeL': 27.7023}
......
[2023-07-24 15:10:29] train | epoch 6 | Iter: 10000/ 14290 | global iter: 10000/ 14290 | loss: 0.0627 | ds_loss: 0.0000 | lr: 1.0407e-05 | scale: 32768.0000 | micro time: 1.189 | step time: 1.188
dev | avg_loss: 3.27813720703125 | {'exact_match': 4.2, 'rougeL': 28.5247}
......
[2023-07-24 15:43:35] train | epoch 7 | Iter: 11432/ 14290 | global iter: 11432/ 14290 | loss: 0.0307 | ds_loss: 0.0000 | lr: 4.8715e-06 | scale: 32768.0000 | micro time: 1.194 | step time: 1.192
dev | avg_loss: 3.38690185546875 | {'exact_match': 4.0, 'rougeL': 28.6034}
......
[2023-07-24 16:16:19] train | epoch 8 | Iter: 12860/ 14290 | global iter: 12860/ 14290 | loss: 0.0184 | ds_loss: 0.0000 | lr: 1.3279e-06 | scale: 32768.0000 | micro time: 1.197 | step time: 1.192
dev | avg_loss: 3.46453857421875 | {'exact_match': 4.7, 'rougeL': 28.4784}
......
[2023-07-24 16:49:04] train | epoch 9 | Iter: 14288/ 14290 | global iter: 14288/ 14290 | loss: 0.0097 | ds_loss: 0.0000 | lr: 1.0002e-07 | scale: 65536.0000 | micro time: 1.194 | step time: 1.194
dev | avg_loss: 3.4901123046875 | {'exact_match': 4.2, 'rougeL': 29.0872}
The experiment records above show that as the training progresses, both the exact_match and rougeL scores increase overall, which is consistent with expectations. However, it is puzzling that the validation set loss (avg_loss) is gradually increasing as well. According to common knowledge in deep learning, an increase in the loss function contradicts an increase in recognition/detection accuracy. How can we explain this phenomenon?
Is it possible to share the actions in the RL? For example, how to make variation of the prompt in order to improve the aesthetic score? Thank you very much.
I tried to use a simlar dataset alpaca-zh to sft the llama-7b on 16 x 32G v100 gpus. gpu_per_node=8 ,node_num=2.
The script I use is scripts/llama/sft/sft_7B.sh
.
But the training loss did not decrease if I use --deepspeed_config ${BASE_PATH}/configs/deepspeed/ds_config_zero2.json"
.
Even if I switch the learning rate and weight_decay, there is no difference. The train loss did not decrease, and the val rougeL score decrease with training.
So I switch to use only 8 gpu (one node) to sft llama-7b.
I have to change the deepspeed config to train llama-7b on a node(8 gpus) because it will run out of memory if I still use the above deepspeed config. The new config I use to reduce memory is as follows:
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"zero_force_ds_cpu_optimizer": false,
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 11,
"loss_scale_window": 5000,
"hysteresis": 4
},
"wall_clock_breakdown": false
}
when I use only one single node (8 v100 gpus) to run this script. The training loss of llama-7b decrease normally.
Besides, the sft of gpt-base/ gpt-xl/opt-1.5b(trained on 8 gpus) are normal, but the sft of opt-13b(trained on 16 gpus) faced the same problem as llama-7b(trained on 16 gpus) .
So I guess this has something to do with the the multi node training.
Hi,
Thanks for the outstanding paper "Why Can GPT Learn In-Context?".
I don't know if this is the right place to talk about this, but I might've found a typo in the paper and want to let you know about it.
In equation (8), you have the transpose as in
However, you have the outer product symbol
So, the transpose operation is only necessary if you mean the outer product symbol is matrix multiplication. Otherwise, removing the outer product symbol may be better, as it is eventually done in equation (9).
I hope it's helpful to you.
def setup_model_and_optimizer(args, ds_config, device, set_optim=True):
# get the model
model = get_model(args, device)
# get the optimizer and lr_scheduler
if set_optim:
optimizer = get_optimizer(args, model)
lr_scheduler = get_learning_rate_scheduler(args, optimizer)
else:
optimizer, lr_scheduler = None, None
model_ori = model
model, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model,
optimizer=optimizer,
args=args,
lr_scheduler=lr_scheduler,
mpu=mpu if args.model_parallel else None,
config_params=ds_config
Run DeepSpeed. After initialization, some parameters of the model are missing, such as model_ori.transformer.h[0].attn.c_attn.wight.
For the following distillation loss that's being used in the repo, do you ever face NaN issues?
teacher_probs = F.softmax(teacher_logits, dim=-1, dtype=torch.float32)
inf_mask = torch.isinf(logits)
logprobs = F.log_softmax(logits, dim=-1, dtype=torch.float32)
prod_probs = torch.masked_fill(teacher_probs * logprobs, inf_mask, 0)
x = torch.sum(prod_probs, dim=-1).view(-1)
mask = (no_model_batch["label"] != -100).int()
distil_loss = -torch.sum(x * mask.view(-1), dim=0) / torch.sum(mask.view(-1), dim=0)
Although it works fine for the first couple of hundreds iteration but then starts producing NaNs
As the title says, the bos_token_id
is omitted during dataset preprocessing (process_data_dolly.py#L40 has add_special_tokens=False
), but is not removed during evaluation (prompt_datasets.py#L66, for example, lacks add_special_tokens
).
Indeed, datapoints from the prerocessed dolly dataset you've provided have no tokens 1
, however, input_ids tensors in the eval script do have it.
So which one is the correct way to use the model?
I'm trying to run the generate_dense_embeddings script with the following command
python DPR/generate_dense_embeddings.py model_file=/root/LMOps/uprise/archive/data.pkl ctx_src=dpr_uprise shard_id=0 num_shards=1 out_file=$PWD/my_data/experiment/uprise/dpr_enc_index ctx_sources.dpr_uprise.prompt_pool_path=${PROMPT_POOL} ctx_sources.dpr_uprise.prompt_setup_type=qa encoder.cache_dir=${CACHE_DIR} hydra.run.dir=$PWD/my_data/experiment/uprise
But I'm getting the below error.
[2023-10-17 13:48:09,195][root][INFO] - Reading saved model from /root/LMOps/uprise/archive/data.pkl
Error executing job with overrides: ['model_file=/root/LMOps/uprise/archive/data.pkl', 'ctx_src=dpr_uprise', 'shard_id=0', 'num_shards=1', 'out_file=/root/LMOps/uprise/my_data/experiment/uprise/dpr_enc_index', 'ctx_sources.dpr_uprise.prompt_pool_path=prompt_pool.json', 'ctx_sources.dpr_uprise.prompt_setup_type=qa', 'encoder.cache_dir=cache/']
Traceback (most recent call last):
File "DPR/generate_dense_embeddings.py", line 106, in main
saved_state = load_states_from_checkpoint(cfg.model_file)
File "/root/LMOps/uprise/DPR/dpr/utils/model_utils.py", line 170, in load_states_from_checkpoint
state_dict = torch.load(
File "/root/LMOps/uprise/.env/lib/python3.8/site-packages/torch/serialization.py", line 1028, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/root/LMOps/uprise/.env/lib/python3.8/site-packages/torch/serialization.py", line 1246, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.
Hi,
You could potentially enable multiple trainer models (say a great coding model and a great conversational model) by using a simple sum of two reverse KLD's?
θ = arg min J (θ) = arg min KL[qθ||p_1] + KL[qθ||p_2]
It would potentially counter the effect of ignoring important minor modes (if and when those exist) the the probability vector. This could arguable speed up base model training if you already have various solid base models with partial world knowledge (a specific language, or programming language).
I dont know enough math to know whether it would then complicate the gradient derivation in equation 2.
Hello, I would like to ask Can I apply minillm to phoenix model.
What code do I need to modify and how to modify it?
Thanks for your help
If it is possible to update the scaled attention of transformers.models.xglm.modeling_xglm for Structured Prompting: https://github.com/microsoft/LMOps/tree/main/structured_prompting/hf-version
thank you
causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
but in Structured Prompting the key_length exceeds the max_positions.
How to address this issue. Thank you.
Much thanks for your excellent work about:
Why Can GPT Learn In-Context? Language Models Secretly Perform Finetuning as Meta Optimizers
I think this work will help us have a better understanding about ICL & LMs model designing. Therefore, I'm expect to see the codes about this work, especially for the part of momentum attention.
Open source will lead to positive effect, thx!
We know that when the batch size increases, it benefits from the powerful parallel capabilities of the GPU, and the speedup is often significantly larger than the 2-3X acceleration mentioned in the llma readme. So, I am quite curious whether I have misunderstood something about this source code.
It seems that if we apply the llma algorithm in batch form, the actual step length for each sentence at each time step varies, making it impossible to form a tensor properly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.