Hi, I wanted to try out the Llama2 model that you recently published

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Ah ok. I tried it like this: <div class="highlight highlight-source-shell notransl

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Great to hear that <a class="user-mention notranslate" data-hovercard-type="user" data

NM: error: Node (/model/Add_1) Op (Add) [ShapeInferenceError] Incompatible dimension about deepsparse HOT 5 CLOSED

mneedham commented on June 10, 2024

NM: error: Node (/model/Add_1) Op (Add) [ShapeInferenceError] Incompatible dimension

from deepsparse.

Comments (5)

mgoin commented on June 10, 2024

Hi @mneedham LLMs for text generation like Llama only support running in a "text-generation" pipeline, so please use that task name instead of sentiment-analysis. You can also use a TextGeneration object directly, see the documentation here https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md

from deepsparse.

mneedham commented on June 10, 2024

Ah ok. I tried it like this:

docker run -it -v $PWD/downloads:/tmp deepsparse:0.0.3

This is what's in the downloads/llama2 directory:

$ tree downloads/llama2
downloads/llama2
├── deployment
│   ├── config.json
│   ├── model.data
│   ├── model.onnx
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── deployment.tar.gz
└── downloads

And then I run it:

from deepsparse import TextGeneration

zoo_stub = "/tmp/llama2/deployment"  
pipeline = TextGeneration(model=zoo_stub)

2023-11-25 17:49:18 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
[nm_ort ffff987fed40 >ERROR< init src/libdeepsparse/ort_engine/ort_engine.cpp:538] std exception  Node (concat.past_key_values.0.value_transposed) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=32 Target=128 Dimension=2

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 4
      1 from deepsparse import TextGeneration
      3 zoo_stub = "/tmp/llama2/deployment"
----> 4 pipeline = TextGeneration(model=zoo_stub)

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:814, in text_generation_pipeline(model, *args, **kwargs)
    809 """
    810 :return: text generation pipeline with the given args and
    811     kwargs passed to Pipeline.create
    812 """
    813 kwargs = _parse_model_arg(model, **kwargs)
--> 814 return Pipeline.create("text_generation", *args, **kwargs)

File /usr/local/lib/python3.11/site-packages/deepsparse/base_pipeline.py:210, in BasePipeline.create(task, **kwargs)
    204     buckets = pipeline_constructor.create_pipeline_buckets(
    205         task=task,
    206         **kwargs,
    207     )
    208     return BucketingPipeline(pipelines=buckets)
--> 210 return pipeline_constructor(**kwargs)

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_generation.py:281, in TextGenerationPipeline.__init__(self, sequence_length, prompt_sequence_length, force_max_tokens, internal_kv_cache, generation_config, **kwargs)
    278 if not self.tokenizer.pad_token:
    279     self.tokenizer.pad_token = self.tokenizer.eos_token
--> 281 self.engine, self.multitoken_engine = self.initialize_engines()
    283 # auxiliary flag for devs to enable debug mode for the pipeline
    284 self._debug = False

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_generation.py:361, in TextGenerationPipeline.initialize_engines(self)
    346 if (
    347     self.cache_support_enabled and self.enable_multitoken_prefill
    348 ) or not self.cache_support_enabled:
   (...)
    353     #   (the prompt is processed in a single pass, prompts length is fixed at
    354     #   sequence_length)
    355     input_ids_length = (
    356         self.prompt_sequence_length
    357         if self.cache_support_enabled
    358         else self.sequence_length
    359     )
--> 361     multitoken_engine = NLDecoderEngine(
    362         onnx_file_path=self.onnx_file_path,
    363         engine_type=self.engine_type,
    364         engine_args=self.engine_args,
    365         engine_context=self.context,
    366         sequence_length=self.sequence_length,
    367         input_ids_length=input_ids_length,
    368         internal_kv_cache=self.internal_kv_cache,
    369         timer_manager=self.timer_manager,
    370     )
    372 if self.cache_support_enabled:
    373     engine = NLDecoderEngine(
    374         onnx_file_path=self.onnx_file_path,
    375         engine_type=self.engine_type,
   (...)
    381         timer_manager=self.timer_manager,
    382     )

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/engines/nl_decoder_engine.py:82, in NLDecoderEngine.__init__(self, onnx_file_path, engine_type, engine_args, sequence_length, input_ids_length, engine_context, internal_kv_cache, timer_manager)
     78     if internal_kv_cache and engine_type == DEEPSPARSE_ENGINE:
     79         # inform the engine, that are using the kv cache
     80         engine_args["cached_outputs"] = output_indices_to_be_cached
---> 82 self.engine = create_engine(
     83     onnx_file_path=onnx_file_path,
     84     engine_type=engine_type,
     85     engine_args=engine_args,
     86     context=engine_context,
     87 )
     88 self.timer_manager = timer_manager or TimerManager()
     89 self.sequence_length = sequence_length

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:759, in create_engine(onnx_file_path, engine_type, engine_args, context)
    754         return MultiModelEngine(
    755             model=onnx_file_path,
    756             **engine_args,
    757         )
    758     engine_args.pop("cache_output_bools", None)
--> 759     return Engine(onnx_file_path, **engine_args)
    761 if engine_type == ORT_ENGINE:
    762     return ORTEngine(onnx_file_path, **engine_args)

File /usr/local/lib/python3.11/site-packages/deepsparse/engine.py:327, in Engine.__init__(self, model, batch_size, num_cores, num_streams, scheduler, input_shapes, cached_outputs)
    317         self._eng_net = LIB.deepsparse_engine(
    318             model_path,
    319             engine_batch_size,
   (...)
    324             cached_outputs,
    325         )
    326 else:
--> 327     self._eng_net = LIB.deepsparse_engine(
    328         self._model_path,
    329         engine_batch_size,
    330         self._num_cores,
    331         self._num_streams,
    332         self._scheduler.value,
    333         None,
    334         cached_outputs,
    335     )
    337 if self._batch_size is None:
    338     os.environ.pop("NM_DISABLE_BATCH_OVERRIDE", None)

RuntimeError: NM: error: Node (concat.past_key_values.0.value_transposed) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=32 Target=128 Dimension=2

from deepsparse.

dbogunowicz commented on June 10, 2024

Hey @mneedham

I have difficulty reproducing your error. Let's try to get a minimal working example working. I am afraid that once the llama2 model has been downloaded to your disk, you may have unintentionally modified it when running previous, incorrect commands.

Setup

Spin up your docker as you did before. I think that when it comes to container initialization you are doing everything correctly Make sure that ROOT/.cache/sparsezoo/ is empty (there is not lingering, potentially corrupted, llama2 model in your cache).

For completeness, my setup is:
ubuntu-20.04
deepsparse-nightly (fresh pip install -U deepsparse-nightly[llm])
python 3.10

Run minimal example

Now enter your docker container and execute:

from deepsparse import TextGeneration
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized"
pipeline = TextGeneration(model=model_path)
generations = pipeline(prompt="Who is the president of the United States?")
print(generations)

You should see this output.

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Downloading (…)ed/deployment.tar.gz: 100%|██████████| 3.92G/3.92G [05:44<00:00, 12.2MB/s]
2023-11-30 12:22:55 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231128 COMMUNITY | (46baca65) (release) (optimized) (system=avx2, binary=avx2)
[7fbcf691a700 >WARN<  operator() ./src/include/wand/utility/warnings.hpp:14] Generating emulated code for quantized (INT8) operations since no VNNI instructions were detected. Set NM_FAST_VNNI_EMULATION=1 to increase performance at the expense of accuracy.
created=datetime.datetime(2023, 11, 30, 12, 23, 50, 707904) prompts='Who is the president of the United States?' generations=[GeneratedText(text='The president of the United States is the person who is the most senior in the chain of command.\nThe chain of command is the set of people who are in charge of the different parts of the government.\nThe president is the most senior in the chain of command, so he is the 1st in the chain of command.\n#### 1', score=None, finished=True, finished_reason='stop')] input_tokens=None

Could you try following these instructions?

from deepsparse.

mneedham commented on June 10, 2024

Hey @dbogunowicz,

Sorry for the delayed reply - I only just saw your reply now! The example that you provided works great, thanks!

In [8]: generations = pipeline(prompt="Who is the president of the United States?", streaming=True)

In [9]: %%time
   ...: for it in generations:
   ...:     print(it.generations[0].text, end=" ")
   ...:
<s> The president of the United States is the head of the executive branch of the government .
 The president is also the head of the government .
 The president is the head of the government and the head of the executive branch , so the president is also the head of the whole government .
 ####  1 </s> CPU times: user 48.1 s, sys: 17.5 ms, total: 48.1 s
Wall time: 8.19 s

from deepsparse.

dbogunowicz commented on June 10, 2024

Great to hear that @mneedham!

I will close this issue, as it is resolved. I hope that you will have fun working with NM products. If you happen to come across any problems, feel free to reach out to us!

from deepsparse.

NM: error: Node (/model/Add_1) Op (Add) [ShapeInferenceError] Incompatible dimension about deepsparse HOT 5 CLOSED

Comments (5)

Setup

Run minimal example

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent