Git Product home page Git Product logo

Comments (5)

mgoin avatar mgoin commented on June 10, 2024

Hi @mneedham LLMs for text generation like Llama only support running in a "text-generation" pipeline, so please use that task name instead of sentiment-analysis. You can also use a TextGeneration object directly, see the documentation here https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md

from deepsparse.

mneedham avatar mneedham commented on June 10, 2024

Ah ok. I tried it like this:

docker run -it -v $PWD/downloads:/tmp deepsparse:0.0.3

This is what's in the downloads/llama2 directory:

$ tree downloads/llama2
downloads/llama2
├── deployment
│   ├── config.json
│   ├── model.data
│   ├── model.onnx
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── deployment.tar.gz
└── downloads

And then I run it:

from deepsparse import TextGeneration

zoo_stub = "/tmp/llama2/deployment"  
pipeline = TextGeneration(model=zoo_stub)
2023-11-25 17:49:18 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
[nm_ort ffff987fed40 >ERROR< init src/libdeepsparse/ort_engine/ort_engine.cpp:538] std exception  Node (concat.past_key_values.0.value_transposed) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=32 Target=128 Dimension=2

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 4
      1 from deepsparse import TextGeneration
      3 zoo_stub = "/tmp/llama2/deployment"
----> 4 pipeline = TextGeneration(model=zoo_stub)

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:814, in text_generation_pipeline(model, *args, **kwargs)
    809 """
    810 :return: text generation pipeline with the given args and
    811     kwargs passed to Pipeline.create
    812 """
    813 kwargs = _parse_model_arg(model, **kwargs)
--> 814 return Pipeline.create("text_generation", *args, **kwargs)

File /usr/local/lib/python3.11/site-packages/deepsparse/base_pipeline.py:210, in BasePipeline.create(task, **kwargs)
    204     buckets = pipeline_constructor.create_pipeline_buckets(
    205         task=task,
    206         **kwargs,
    207     )
    208     return BucketingPipeline(pipelines=buckets)
--> 210 return pipeline_constructor(**kwargs)

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_generation.py:281, in TextGenerationPipeline.__init__(self, sequence_length, prompt_sequence_length, force_max_tokens, internal_kv_cache, generation_config, **kwargs)
    278 if not self.tokenizer.pad_token:
    279     self.tokenizer.pad_token = self.tokenizer.eos_token
--> 281 self.engine, self.multitoken_engine = self.initialize_engines()
    283 # auxiliary flag for devs to enable debug mode for the pipeline
    284 self._debug = False

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_generation.py:361, in TextGenerationPipeline.initialize_engines(self)
    346 if (
    347     self.cache_support_enabled and self.enable_multitoken_prefill
    348 ) or not self.cache_support_enabled:
   (...)
    353     #   (the prompt is processed in a single pass, prompts length is fixed at
    354     #   sequence_length)
    355     input_ids_length = (
    356         self.prompt_sequence_length
    357         if self.cache_support_enabled
    358         else self.sequence_length
    359     )
--> 361     multitoken_engine = NLDecoderEngine(
    362         onnx_file_path=self.onnx_file_path,
    363         engine_type=self.engine_type,
    364         engine_args=self.engine_args,
    365         engine_context=self.context,
    366         sequence_length=self.sequence_length,
    367         input_ids_length=input_ids_length,
    368         internal_kv_cache=self.internal_kv_cache,
    369         timer_manager=self.timer_manager,
    370     )
    372 if self.cache_support_enabled:
    373     engine = NLDecoderEngine(
    374         onnx_file_path=self.onnx_file_path,
    375         engine_type=self.engine_type,
   (...)
    381         timer_manager=self.timer_manager,
    382     )

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/engines/nl_decoder_engine.py:82, in NLDecoderEngine.__init__(self, onnx_file_path, engine_type, engine_args, sequence_length, input_ids_length, engine_context, internal_kv_cache, timer_manager)
     78     if internal_kv_cache and engine_type == DEEPSPARSE_ENGINE:
     79         # inform the engine, that are using the kv cache
     80         engine_args["cached_outputs"] = output_indices_to_be_cached
---> 82 self.engine = create_engine(
     83     onnx_file_path=onnx_file_path,
     84     engine_type=engine_type,
     85     engine_args=engine_args,
     86     context=engine_context,
     87 )
     88 self.timer_manager = timer_manager or TimerManager()
     89 self.sequence_length = sequence_length

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:759, in create_engine(onnx_file_path, engine_type, engine_args, context)
    754         return MultiModelEngine(
    755             model=onnx_file_path,
    756             **engine_args,
    757         )
    758     engine_args.pop("cache_output_bools", None)
--> 759     return Engine(onnx_file_path, **engine_args)
    761 if engine_type == ORT_ENGINE:
    762     return ORTEngine(onnx_file_path, **engine_args)

File /usr/local/lib/python3.11/site-packages/deepsparse/engine.py:327, in Engine.__init__(self, model, batch_size, num_cores, num_streams, scheduler, input_shapes, cached_outputs)
    317         self._eng_net = LIB.deepsparse_engine(
    318             model_path,
    319             engine_batch_size,
   (...)
    324             cached_outputs,
    325         )
    326 else:
--> 327     self._eng_net = LIB.deepsparse_engine(
    328         self._model_path,
    329         engine_batch_size,
    330         self._num_cores,
    331         self._num_streams,
    332         self._scheduler.value,
    333         None,
    334         cached_outputs,
    335     )
    337 if self._batch_size is None:
    338     os.environ.pop("NM_DISABLE_BATCH_OVERRIDE", None)

RuntimeError: NM: error: Node (concat.past_key_values.0.value_transposed) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=32 Target=128 Dimension=2

from deepsparse.

dbogunowicz avatar dbogunowicz commented on June 10, 2024

Hey @mneedham

I have difficulty reproducing your error. Let's try to get a minimal working example working. I am afraid that once the llama2 model has been downloaded to your disk, you may have unintentionally modified it when running previous, incorrect commands.

Setup

Spin up your docker as you did before. I think that when it comes to container initialization you are doing everything correctly Make sure that ROOT/.cache/sparsezoo/ is empty (there is not lingering, potentially corrupted, llama2 model in your cache).

For completeness, my setup is:
ubuntu-20.04
deepsparse-nightly (fresh pip install -U deepsparse-nightly[llm])
python 3.10

Run minimal example

Now enter your docker container and execute:

from deepsparse import TextGeneration
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized"
pipeline = TextGeneration(model=model_path)
generations = pipeline(prompt="Who is the president of the United States?")
print(generations)

You should see this output.

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Downloading (…)ed/deployment.tar.gz: 100%|██████████| 3.92G/3.92G [05:44<00:00, 12.2MB/s]
2023-11-30 12:22:55 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231128 COMMUNITY | (46baca65) (release) (optimized) (system=avx2, binary=avx2)
[7fbcf691a700 >WARN<  operator() ./src/include/wand/utility/warnings.hpp:14] Generating emulated code for quantized (INT8) operations since no VNNI instructions were detected. Set NM_FAST_VNNI_EMULATION=1 to increase performance at the expense of accuracy.
created=datetime.datetime(2023, 11, 30, 12, 23, 50, 707904) prompts='Who is the president of the United States?' generations=[GeneratedText(text='The president of the United States is the person who is the most senior in the chain of command.\nThe chain of command is the set of people who are in charge of the different parts of the government.\nThe president is the most senior in the chain of command, so he is the 1st in the chain of command.\n#### 1', score=None, finished=True, finished_reason='stop')] input_tokens=None

Could you try following these instructions?

from deepsparse.

mneedham avatar mneedham commented on June 10, 2024

Hey @dbogunowicz,

Sorry for the delayed reply - I only just saw your reply now! The example that you provided works great, thanks!

In [8]: generations = pipeline(prompt="Who is the president of the United States?", streaming=True)

In [9]: %%time
   ...: for it in generations:
   ...:     print(it.generations[0].text, end=" ")
   ...:
<s> The president of the United States is the head of the executive branch of the government .
 The president is also the head of the government .
 The president is the head of the government and the head of the executive branch , so the president is also the head of the whole government .
 ####  1 </s> CPU times: user 48.1 s, sys: 17.5 ms, total: 48.1 s
Wall time: 8.19 s

from deepsparse.

dbogunowicz avatar dbogunowicz commented on June 10, 2024

Great to hear that @mneedham!

I will close this issue, as it is resolved. I hope that you will have fun working with NM products. If you happen to come across any problems, feel free to reach out to us!

from deepsparse.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.