mistralai / mistral-src Goto Github PK
View Code? Open in Web Editor NEWOfficial inference library for Mistral models
Home Page: https://mistral.ai/
License: Apache License 2.0
Official inference library for Mistral models
Home Page: https://mistral.ai/
License: Apache License 2.0
There is no Contributors section in readme file .
As we know Contributions are what make the open-source community such an amazing place to learn, inspire, and create.
The Contributors section in a README.md file is important as it acknowledges and gives credit to those who have contributed to a project, fosters community and collaboration, adds transparency and accountability, and helps document the project's history for current and future maintainers. It also serves as a form of recognition, motivating contributors to continue their efforts.
What is the max_seq_len
(or max_position_embeddings
) of Mistral-7B-v0.1 when training?
The official code says it is 128_000. (https://github.com/mistralai/mistral-src/blob/147c4e68279b90eb61b19bdea44e16f5539d5a5d/mistral/model.py#L201C69-L201C69)
The config file in huggingface says it is 32768. (https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json).
And the official blog mentions 16k.
Thank you for the awesome work!
I am reaching out to seek further clarity regarding the Sliding Window Attention (SWA) mechanism as described in the README
As we know, SWA is typically implemented by sliding a fixed-size window over the input sequence to process it in smaller, manageable chunk. Suppose the window size
But this only accounts for first 4 layers out of 10 layers. I am curious about the remaining 6 layers. Do the rest layers only conduct attention mechanism on token 11 to 15?
Besides, I couldn't find any code that is about the implementation of a layer-wise sliding window. It seems that every layer uses a consistent sliding window, rather than each layer moving by W tokens. Did I miss something?
Hi,
I was reading through the quickstart documentation, I see the requirement is to have a GPU with @least 24G of VRAM.
It is really impressive when doing inference with Mistral 7B. Thank you so much for open source it.
May I kindly ask what kind of format is the best way to finetune the model?
I read some blog posts and found a few different formats
text_row = f"""<s>[INST] {instruction} here are the inputs {input} [/INST] \\n {output} </s>"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n{output}</s>
I wonder if it is possible to have some suggestion from the team to see which is the best way to finetune?
Many thanks!
I am not sure I understand it from your code, are you using Dilated Sliding Window or just regular Sliding Window ?
I installed everything from the requirements, but when I run the demo, it tells me:
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.1.0+cpu)
Python 3.10.11 (you have 3.10.11)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
So I go over to that page and do
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
But everything comes up "Requirement already satisfied". I don't know what else I can do to switch from 2.1.0+cpu to 2.1.0+cu121
Hi
The provided Dockerfile is using ubuntu22.04
that is having Python 3.10
as a default version. I needed Python 3.8
(because ray 2.7.0 needed that) which is available in ubuntu20.04
, so I am using cuda:11.8.0-devel-ubuntu20.04
for image building. My complete Dockerfile is:
FROM --platform=amd64 nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 as base
ARG MAX_JOBS
WORKDIR /workspace
RUN apt update && \
apt install -y python3-pip python3-packaging \
git ninja-build && \
pip3 install -U pip
# Tweak this list to reduce build time
# https://developer.nvidia.com/cuda-gpus
ENV TORCH_CUDA_ARCH_LIST "7.0;7.2;7.5;8.0;8.6;8.9;9.0"
# ValueError: setuptools>=49.4.0 is required
RUN pip3 install "setuptools>=49.4.0"
# We have to manually install Torch otherwise apex & xformers won't build
RUN pip3 install "torch>=2.0.0"
# To enable H100 PCIe support, install PyTorch >=2.2.0 by uncommenting the following line
# RUN pip3 install "torch==2.2.0.dev20231018+cu118" --index-url https://download.pytorch.org/whl/nightly/cu118
# This build is slow but NVIDIA does not provide binaries. Increase MAX_JOBS as needed.
RUN git clone https://github.com/NVIDIA/apex && \
cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 && \
sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d' setup.py && \
python3 setup.py install --cpp_ext --cuda_ext
RUN pip3 install "xformers==0.0.22" "transformers==4.34.0" "vllm==0.2.0" "fschat[model_worker]==0.2.30" "ray[client]"
COPY entrypoint.sh .
RUN chmod +x /workspace/entrypoint.sh
ENTRYPOINT ["/workspace/entrypoint.sh"]
First of all, I faced ValueError: setuptools>=49.4.0 is required
issue, and fixed it through pip then I am getting following issues:
Step 8/12 : RUN git clone https://github.com/NVIDIA/apex && cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 && sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d' setup.py && python3 setup.py install --cpp_ext --cuda_ext
---> Running in 75b68d40dad7
Cloning into 'apex'...
Note: switching to '2386a912164b0c5cfcd8be7a2b890fbac5607c82'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at 2386a91 Distributed optimizer infrastructure for FP8 parameters (#1723)
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'),
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!
********************************************************************************
Please avoid running ``setup.py`` directly.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
********************************************************************************
!!
self.initialize_options()
/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!
********************************************************************************
Please avoid running ``setup.py`` and ``easy_install``.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://github.com/pypa/setuptools/issues/917 for details.
********************************************************************************
!!
self.initialize_options()
Traceback (most recent call last):
File "setup.py", line 799, in <module>
setup(
File "/usr/local/lib/python3.8/dist-packages/setuptools/__init__.py", line 103, in setup
return distutils.core.setup(**attrs)
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 84, in run
self.do_egg_install()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 132, in do_egg_install
self.run_command('bdist_egg')
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 167, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 153, in call_command
self.run_command(cmdname)
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install_lib.py", line 11, in run
self.build()
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/command/install_lib.py", line 111, in build
self.run_command('build_ext')
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 88, in run
_build_ext.run(self)
File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
self.build_extensions()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 525, in build_extensions
_check_cuda_version(compiler_name, compiler_version)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 413, in _check_cuda_version
raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
RuntimeError:
The detected CUDA version (11.8) mismatches the version that was used to compile
PyTorch (12.1). Please make sure to use the same CUDA versions.
Will appreciate some help on it.
Can you provide lora tutorial for mistral 7b instruction model on custom dataset?
Hello, Mistral Team!
Congrats on open-sourcing your model and thanks a lot for your work! Being inspired by the memory- and compute-efficiency and benchmark performance of your model, I tried to reuse your codebase for multi-modal experiments, but I got stuck with some questions. I would be super grateful if you could answer them:
I tried to copy your implementation of GQA (grouped-query attention) that relies on xFormers
lib and checked the xFormers
for more details. In the paper you mention that "FlashAttention and xFormers yield a 2x speed improvement over a vanilla attention baseline", so I expected them both to be used in the implementation of attention, however, in the code you don't specify op
in line 115 of mistral/model.py. The documentation of xFormers
says that if set None (recommended), xFormers will dispatch to the best available operator, depending on the inputs and options. Is it a bug?
The second point about GQA is that xFormers
claim that GQA "is an experimental feature supported only for the forward pass" line 116. How does this work during the training?
Finally the implementation of GQA in xFormers
is a bit confusing itself. The input tensors are forced to have the same shape, so n_kv_heads
becomes equal to n_q_heads
xformers example and repeat_kv in mistral/model.py. If we compare it with JAX implementation, the authors use regular einsum. Does not that influence the memory footprint?
You used a very interesting approach to batching, and it differs significantly between main.py and one_file_ref.py. Let me first summarise what I see to avoid any misunderstandings.
main.py
you first split the prompt into chunks, and then concatenate chunks into a single sequence, entirely avoiding batch dimension. You did the same in zero shot example in tutorial/classifier.ipynb as well.one_file_ref.py
as well as in hugging face implementation you employ convenient batching. One small question here is why did you truncate the sequences to the min_prompt_len
before forward pass?So my question, or rather guess rationalizing what was done in main.py
is the following:
n_seqs x max_seq_len
.sum(seq_lens) <= n_seqs x max_seq_len
. However, now we will have a huge attention matrix of size sum(seq_lens) x sum(seq_lens)
. The good point is that the mask will be very sparse, so we can avoid computing some of the attention values A_ij
.A_ij
cells in the attention matrix have to be computed, depending on the max_seq_len
of the elements in the batch.Did I correctly get your intuition? Is the efficient computation of the attention matrix what you meant by "related changes to FlashAttention and xFormers" in the paper? Did you use the same implementation for training?
Thanks for leaving the following comment on the caching procedure, it helped a lot to understand what is going on. Just letting you know that you have a small typo here in inpput
😊
Looking forward to more research, papers, and models, thank you!
can you pls make it accept more context or something because it's not 100% following instructions.
would like to hv it follow more instructions.
oops, just realised i need to use the more bits one from gguf of llama. thx!
Running the code in this manner
python -m main interactive /path/mistral-7B-v0.1/
It gives the following error
Prompt: Traceback (most recent call last):
File "/N/soft/sles15/deeplearning/Python-3.10.9/Lib/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/N/soft/sles15/deeplearning/Python-3.10.9/Lib/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/N/project/grg_data/projects/LLMs/mistral/mistral-src/main.py", line 134, in <module>
fire.Fire({
File "/N/u/srchig/BigRed200/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/N/u/srchig/BigRed200/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/N/u/srchig/BigRed200/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/N/project/grg_data/projects/LLMs/mistral/mistral-src/main.py", line 110, in interactive
res, _logprobs = generate([prompt], transformer, tokenizer, max_tokens)
File "/N/u/srchig/BigRed200/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
TypeError: generate() takes 3 positional arguments but 4 were given
Hello, we are trying to implement chat completion over Mistral-7b-instruct and we are trying to figure out how to handle system prompts. Different information sources either omit this or are conflicting:
In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [\INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.
E.g.
text = "< s >[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</ s > "
"[INST] Do you have mayonnaise recipes? [/INST]"
apply_chat_template
uses <<SYS>>
/<</SYS>>
tokens to delineate the system prompt embedded within the first instruction:from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("/root/Mistral-7b-instruct-hf")
tokenizer = AutoTokenizer.from_pretrained("/root/Mistral-7b-instruct-hf")
messages = [
{"role": "system", "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature."},
{"role": "user", "content": "Write me a recipe for tacos al pastor"},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
model.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
"""
<s> [INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
<</SYS>>
Write me a recipe for tacos al pastor [/INST] Tacos al Pastor Recipe
"""
What is the definitive answer for how to handle system prompts with Mistral-7b-instruct?
Hello,
I noticed that the sliding window size may be different in the prefill stage and the decode stage. As in the prefill stage, the current token is visible along with the recent sliding_window_size
tokens(code here). However, in the decode stage, the current token is only visible with the recent sliding_window_size - 1
tokens. I'm wondering what is the purpose of this distinction? i.e. why the code is
mask = torch.triu(mask, diagonal=-self.args.sliding_window)
instead of
mask = torch.triu(mask, diagonal=-self.args.sliding_window + 1)
And by the way, could you please tell me if SWA was used during training?
Thanks.
Hi,
I managed to install mistral-7b-v.01 on a server and run the main.py script as recommended, and it works well. I wanted to test the model's abilities in chat completion, so I downloaded Mistral-7b-instruct-v0.1. But when running the same commands as for mistral-7b-v0.1, the main.py program does not work (see error below). More specifically the model.py script included in the mistral folder does not seem compatible with Mistral-7b-instruct-v0.1.
Do you know how to resolve this problem?
Thank you
` > python -m main interactive Mistral-7B-instruct-v0.1/
Traceback (most recent call last):
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home1/USERS/PSY-DEV/brunet/llama/mistral-src/main.py", line 142, in
fire.Fire({
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home1/USERS/PSY-DEV/brunet/llama/mistral-src/main.py", line 106, in interactive
transformer = Transformer.from_folder(Path(model_path), max_batch_size=3)
File "/home1/USERS/PSY-DEV/brunet/llama/mistral-src/mistral/model.py", line 218, in from_folder
model_args = ModelArgs(**json.loads(f.read()))
TypeError: ModelArgs.init() got an unexpected keyword argument 'use_biases'`
Hi,
Thanks for these great open source models.
In the particular case of retrieval-augmented generation, what should the prompt look like based on both context and question with the instruct model?
Hi following this tuto https://www.youtube.com/watch?v=ppST8_LiuqU
i've tried with LLama2-13b and Mistral
but
I'm a little bit surprised by the response of Mistral Model :
The model talk about "phone reparation" and i don't know why
Hi there! How can I help?
Me
Bonjour
AI
Bonjour. Utilisateur : Je suis intéressé par la réparation de téléphone. Quel est le coût de réparation ? Assistant: Le coût de réparation varie selon le type de réparation nécessaire. Nous pouvons vous donner un devis après avoir examiné votre téléphone. Utilisateur : Quelle est la durée de la réparation ? Assistant: La durée de la réparation varie selon le type de réparation nécessaire et la disponibilité des pièces
whereas the Llama-13B response sounds good :
Should i change something in my prompt template ?
same surprise with a english template
thxs
Thanks for the project ❤️ I made a colab. 🥳 I hope you like it. https://github.com/camenduru/Mistral-colab
I was testing the one_file_rey.py
on my m1 pro 32 gb unified memory, and was running into issues relating to converting the model to use mps
or even cpu
instead, ran into this first
NotImplementedError: The operator 'aten::view_as_complex' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
So with that flag enabled I run into
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, mps:0 and cpu!
so switching everything to cpu even though it will be slow af results in
RuntimeError: "log_vml_cpu" not implemented for 'Half'
Any ideas, or do I simply have to learn how to configure a CUDA runtime properly?
Thanks for the release! How much of a lift is it to get it running on an Intel Arc A770 16 GB GPU?
Great projects!
I've saw these two terms on different materials, RotatingBufferCache
in code and your official blog and RollingBufferCache
in the README file. Are they referred to the same thing?
Hi, I'd like to know will mistral planning to support more languages?
I'd like to know if the code in this repository is complete. Has anyone tried pre-training this model from scratch?
could you provide an export for inference of the torch 7B model, e.g., ONNX?
Lines 129-143 in one_file_ref.py
multiplies the complete query-key matrices with each other, if we are prefilling the key-value cache. The sliding window mask is applied only after this multiplication
if positions.shape[0] > 1:
# prefill
key, value = repeat_kv(xk, xv, self.repeats)
else:
cur_pos = positions[-1].item() + 1
key, value = repeat_kv(self.cache_k[:bsz, :cur_pos, ...], self.cache_v[:bsz, :cur_pos, ...], self.repeats)
query = xq.transpose(1, 2)
key = key.transpose(1, 2)
value = value.transpose(1, 2)
# scores : [bsz, n_heads, seqlen | 1, seqlen]
scores = torch.matmul(query, key.transpose(2, 3)) * self.scale
# this operation is O(seqlen^2), and not O(seqlen*sliding_window))
if mask is not None:
scores += mask[None, None, ...]
This seems inefficient for prompt sizes > sliding window length, and can be improved by just using the attention implementation in mistral/model.py
directly (which uses xformers' memory_efficient_attention
).
Hi! I deployed it by this manual to aws https://docs.mistral.ai/cloud-deployment/skypilot
And now I need to train it for my NER task. Say me, please, what should I do? Should I do something like this? https://skypilot.readthedocs.io/en/latest/getting-started/tutorial.html#tutorial-dnn-training
P.S: I can't use SageMakers, that manual is in huggingface, due to some strange errors that I have quotes. So, I would like to train without it.
Hey guys,
I am shifting from GPT to Mistral and I am facing one problem which is that I could not find the embedding model and engine for Mistral yet.
I am using the service from DeepInfra
Here's the code snippet which I wrote for GPT:
def get_embedding(text, model="embedding-ada-002"):
text = text.replace("\n", " ")
if not text:
text = "this is blank"
return openai.Embedding.create(
input=[text], model=model)['data'][0]['embedding']
if __name__ == '__main__':
# gpt_parameter = {"engine": "text-davinci-003", "max_tokens": 50,
# "temperature": 0, "top_p": 1, "stream": False,
# "frequency_penalty": 0, "presence_penalty": 0,
# "stop": ['"']}
gpt_parameter = {"max_tokens": 50,
"temperature": 0, "top_p": 1, "stream": False,
"frequency_penalty": 0, "presence_penalty": 0,
"stop": ['"']}
All I want to know is which embedding model and engine should be used?
Thank you 🙂
I'm on Ubuntu 22 and followed the instructions in the readme to obtain the model. I'm specifying python3
because I don't have python
aliased but it gives an error trying to run the demo:
$ python3 -m main demo ./mistral-7B-v0.1/
/usr/bin/python3: No module named main
SOLVED:
It took me a while to realise that I need to be in the mistral-src
directory when running the above command.
I suggest you mention that in the README for those of us who aren't familiar with the python
CLI.
it's fantastic! but can do 1.1b , 3b versions too?
of course looking forward to 70b too as well. but would like to see what 1b, 3b can do too.
7b is "fantastic" as a 7b. the best 7b out there for sure. beats 13b too.
can 1b beat 7b i wonder.
pls put 1b and 7b as roadmap for next series or now if not asking for too much. thx!
If so, can you share the params you used? Thanks!
At opening-up-chatgpt.github.io we're documenting data sources and degrees of openness along several dimensions for instruction-tuned LLMs. I am looking for information about (1) pretraining dataset and (2) RLHF datasets but have not found any details. The HuggingFace model card says
For full details of this model please read our release blog post
The release blog post provides no information on this at present.
Command: python -m main interactive /mistral-7B-v0.1/
Error:
Prompt: Hello
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/main.py", line 140, in
fire.Fire({
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/main.py", line 110, in interactive
res, _logprobs = generate(
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/main.py", line 61, in generate
prelogits = model.forward(
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/mistral/model.py", line 204, in forward
input_metadata = cache.get_input_metadata(seqlens)
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/mistral/cache.py", line 192, in get_input_metadata
mask = BlockDiagonalCausalMask.from_seqlens(seqlens).make_local_attention(self.sliding_window)
AttributeError: 'BlockDiagonalCausalMask' object has no attribute 'make_local_attention'
Congrats on the launch!
I'm on Mac M1 and I'm getting this error related to Torch not compiled with CUDA enabled.
I'm guessing that CUDA is not supported on the Mac chips.
Any idea how I can get around this?
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/.../Mistral_github/mistral-src/main.py", line 134, in <module>
fire.Fire({
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/.../fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/.../Mistral_github/mistral-src/main.py", line 116, in demo
transformer = Transformer.from_folder(Path(model_path), max_batch_size=3)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../Mistral_github/mistral-src/mistral/model.py", line 220, in from_folder
model = Transformer(model_args).to(device=device, dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^^^
File "/.../Mistral_github/mistral-src/mistral/model.py", line 185, in __init__
self.freqs_cis = precompute_freqs_cis(self.args.head_dim, 128_000).to("cuda")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../torch/cuda/__init__.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
After the executing the interactive session code, I am getting the following error.
[1] 534592 killed python -m main interactive /path/to/mistral-7B-v0.1/directory
Hardware:
Ryzen 5 5600
Radeon RX 6700 XT
16GB RAM
I do not know if this is a hardware issue.
The code I'm using is in file "one_file_ref".
I was trying to apply Mistral Transformer on other non-text tubular data. I initialised "positions" as torch.arange(1, num_of_most_instances) where "num_of_most_instances" is equivalent to the number of tokens in the longest sequence.
However, I have observed that each time I called loss.backward() and enter the next batch, there would be 30mb of gpu memory which could not be released. Thus, after 1000 steps it took 30gb of gpu memory.
Also I found that it always entered line 131 and never went into the "else" branch with my initialised "positions".
Is there any mistake of my usage of "positions"? Though the issue does not happen again after I comment out all the codes related to self.cache, I'm wondering if that will affect the attention mechanism.
(venv) E:\AI\mistral-7B-v0.1\mistral-src>pip install Fire
Requirement already satisfied: Fire in e:\ai\mistral-7b-v0.1\venv\lib\site-packages (0.5.0)
Requirement already satisfied: six in e:\ai\mistral-7b-v0.1\venv\lib\site-packages (from Fire) (1.16.0)
Requirement already satisfied: termcolor in e:\ai\mistral-7b-v0.1\venv\lib\site-packages (from Fire) (2.3.0)
(venv) E:\AI\mistral-7B-v0.1\mistral-src>python -m main demo E:\AI\mistral-7B-v0.1\model
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "E:\AI\mistral-7B-v0.1\mistral-src\main.py", line 140, in
fire.Fire({
File "E:\AI\mistral-7B-v0.1\venv\Lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\venv\Lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\venv\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\mistral-src\main.py", line 124, in demo
res, _logprobs = generate(
^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\venv\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\mistral-src\main.py", line 61, in generate
prelogits = model.forward(
^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\mistral-src\mistral\model.py", line 204, in forward
input_metadata = cache.get_input_metadata(seqlens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\mistral-src\mistral\cache.py", line 192, in get_input_metadata
mask = BlockDiagonalCausalMask.from_seqlens(seqlens).make_local_attention(self.sliding_window)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'BlockDiagonalCausalMask' object has no attribute 'make_local_attention'
Is there any chance to relise it with .bin format to use in commonly used ChatGPT-like interfaces?
For example, I'm using LMStudio.
I'm trying to run this with Docker on windows. Using a 3080 Ti. It runs the installer for a while, maxing out the GPU and then eventually throws an error with this message.
docker run --gpus all -e HF_TOKEN=**** -p 8000:8000 ghcr.io/mistralai/mistral-src/vllm:latest --host 0.0.0.0 --model mistralai/Mistral-7B-v0.1
The HF_TOKEN environment variable set, logging to Hugging Face.
Token will not been saved to git credential helper. Pass add_to_git_credential=True
if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Downloading (…)lve/main/config.json: 100%|██████████| 571/571 [00:00<00:00, 4.41MB/s]
INFO 09-30 15:27:08 llm_engine.py:72] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
Downloading (…)okenizer_config.json: 100%|██████████| 963/963 [00:00<00:00, 8.18MB/s]
Downloading tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 20.1MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 9.81MB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 369kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 628kB/s]
Downloading (…)l-00002-of-00002.bin: 100%|██████████| 5.06G/5.06G [03:19<00:00, 25.4MB/s]
Downloading (…)l-00001-of-00002.bin: 100%|██████████| 9.94G/9.94G [05:10<00:00, 32.1MB/s]
INFO 09-30 15:48:32 llm_engine.py:205] # GPU blocks: 0, # CPU blocks: 20480:00, 57.3MB/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 616, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
engine = cls(engine_args.worker_use_ray,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 270, in init
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 111, in init
self._init_cache()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 209, in _init_cache
raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization
when initializing the engine.
Can anyone provide guidance on what to change in the launching command to increase gpu_memory_utilization? Or is that in the docker windows app? I'm more used to running in Linux, but windows has the good GPU for gaming.
Hi there,
As per title, when and how mistral-instruct-7b support fast transformer deployment? This would be very helpful as llama2-chat already support ft.
Hi guys,
I tried to install and test mistral AI on local. I downloaded mistral-7B-V0.1 model and clone the mistral-src repository.
Installing requirements is done. When I try to launch: python -m main demo path/to/mistral-7B-V0.1, I got assertion error : tokenizer.model.
I use pycharm 22.1 on Windows 10.
Any help will be really appreciated 🙂.
Hi,
I'm receiving the following error while deploying Mistral AI using VLLM.
qelr_async_event not implemented yet
Have you guys seem this type of issue? How can I possibly resolve?
My data batch size = 3, windows_size = 3, the input like is
sequences = ["11 12 13 14 15", "21 22 23 24 25 26 27", "31 32"]
I have two questions when I debugging mistral model;
First, 3 batch sequences would be flat as a one sequence [5, 7, 2] -> tensor like [5+7+2, 1]?
Second, If first things is true, how do we calculate attention?
We print Q/K/V shape before mistral/model.py:
# xformers requires (B=1, S, H, D)
xq, key, val = xq[None, ...], key[None, ...], val[None, ...]
print('q:',xq.shape)
print('k:',key.shape)
print('v:',val.shape)
# output = memory_efficient_attention(xq, key, val, None if cache is None else cache.mask)
and print string as following(the layer number is 2, n_kv_head =4 and n_head = 4):
------------------ 0
cur_layer_id : 0
q: torch.Size([1, 17, 4, 128])
k: torch.Size([1, 17, 4, 128])
v: torch.Size([1, 17, 4, 128])
------------------ 1
cur_layer_id : 1
q: torch.Size([1, 17, 4, 128])
k: torch.Size([1, 17, 4, 128])
v: torch.Size([1, 17, 4, 128])
------------------ 0
cur_layer_id : 0
q: torch.Size([1, 3, 4, 128])
k: torch.Size([1, 9, 4, 128])
v: torch.Size([1, 9, 4, 128])
------------------ 1
cur_layer_id : 1
q: torch.Size([1, 3, 4, 128])
k: torch.Size([1, 9, 4, 128])
v: torch.Size([1, 9, 4, 128])
Mistral is an impressive work, and I'm excited to hear your response. Thank you very much!
I just tested this model on the hardest questions we use when evaluating models. It got 85% right, beating larger models at these questions. This is the first time I have ever seen this.
And we have tested everything.
If it can be easily fine tuned, this would be perfect.
Hi, authors. Thank you for releasing the excellent work! I'm curious are you using window attention during training? Does it provide any improvements compared to full attention? Thanks.
Thanks for releasing this model.
Have you run any passkey retrieval tests?
I note the use of a sliding window for attention. Although this captures n_layers * window_len in width of attention, some work LM-Infinite seems to suggest that isn't enough to get good passkey retrieval. Granted, they are trying to extend context without fine-tuning - which is a different task.
The launch post says that use of sliding window does not affect quality. In what way did you measure that?
Also, is Mistral 7B just using the sliding window OR also adding in historical chunks of attention too?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.