Comments (9)
I transferred the issue to our api-inference-community
repo. I am not sure it will ever get fixed as the focus is more on TGI-served models at the moment. Still pinging @Narsil just in case: would it be possible to retrieve the maximum input length when sending a sequence to a transformers.pipeline
-powered model?
I saw that it's implemented here (private repo) but don't know how this value could be easily accessible.
from api-inference-community.
What model are you using? If you use a model powered by TGI or a TGI endpoint directly, this is the error you should get:
# for "bigcode/starcoder"
huggingface_hub.inference._text_generation.ValidationError: Input validation error: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 37500 `inputs` tokens and 20 `max_new_tokens`
(In any case, I think this should be handled server-side as the client cannot know this information before-hand.)
from api-inference-community.
The model I was using is https://huggingface.co/microsoft/biogpt with text_generation
. It's supported per InferenceClient list_deployed_models
, but it's not powered by TGI, just transformers
.
I think this request is to add more info to that failure message, so one can better understand next steps. I am not sure where that failure message is being made, if you could link me to a line in a Hugging Face repo, I just want to use some f-strings in its message
from api-inference-community.
Thanks @Wauplin for putting this in the right place!
from api-inference-community.
No there's not, truncating is the only simple option.
The reason is that the API for non TGI models use the pipeline, which raises an exception that would need to be parsed (which would break everytime the message gets updated, which while it shouldn't happen that much is still very much possilbe).
Using truncation should solve most issues for most users, therefore I don't think we should change anything here.
@jamesbraza Any reason truncation doesn't work for you ?
from api-inference-community.
Thanks for the response @Narsil ! Appreciate the details provided too.
which raises an exception that would need to be parsed (which would break everytime the message gets updated, which while it shouldn't happen that much is still very much possilbe).
So there's a few techniques to parsing exceptions, one of which is parsing the error string, which I agree is indeed brittle to change. A more maintainable route involves custom Exception
subclasses:
class InputTooLong(ValueError):
def __init__(self, msg: str, actual_length: int, llm_limit: int) -> None:
super().__init__(msg)
self.actual_length = actual_length
self.llm_limit = llm_limit
try:
raise InputTooLongError(
"Input is too long for this model (removed rest for brevity)",
actual_length=len(prompt_tokens),
llm_limit=llm.limit
)
except Exception as exc:
pass # BadRequestError raised here can contain the length in metadata
This method is readable and is independent of message strings.
Any reason truncation doesn't work for you ?
Fwiw, this request is about just giving more information on the error being faced. I think it would be good if the response had the actual tokens used and the LLM's limit on tokens, and then one can decide on using truncation. I see truncation as a downstream choice after seeing a good error message, and thus using truncation or not is tangent to this issue.
To answer your question, truncation seems like a bad idea to me, because it can lead to unexpected failures caused by truncating away important parts of the prompt. I use RAG, so prompts can become quite big. The actual prompt comes last, which could be truncated away
from api-inference-community.
The actual prompt comes last, which could be truncated away
It's always left truncated for generative models, therefore only loosing the initial part of a prompt.
If you're using RAG with long prompts and really want to know what gets used, then just get a tokenizer to count your prompt lengths, no ?
from api-inference-community.
Hello @Narsil thanks for the response! Good to know truncation is left truncated usually.
then just get a tokenizer to count your prompt lengths, no ?
Yeah this is a workaround. Though it requires one to:
- Know the max input size of the Hugging Face model in advance
- Pick a tokenizer model and instantiate it locally (which may not be possible, depending on memory)
However, Hugging Face already knows the input size and max allowable size because it states "Input is too long for this model". So I think it's much easier for Hugging Face 'server side' to just include this information in the thrown Exception
, either within the message or as an attribute.
from api-inference-community.
Knowing the length in tokens is not really useful if you don't know how to modify the original prompt in order to modify those tokens right ?
Pick a tokenizer model and instantiate it locally (which may not be possible, depending on memory)
What kind of hardware are we talking here ? Tokenizers are extremely tiny.
from api-inference-community.
Related Issues (20)
- What is the ratelimit for inference api for pro users? HOT 1
- Hosted Inference Api, all models error 422 HOT 6
- pydantic.errors.PydanticImportError: `pydantic:ConstrainedFloat` has been removed in V2. HOT 4
- How do we use the detailed Parameters for the api? HOT 1
- Many of the docker images seem to be out of sync with latest inference community version HOT 7
- Update docker images to latest version of api-inference-community version
- Update docker images to latest version of api-inference-community version HOT 2
- meta-llama/Llama-2-70b-chat-hf Inference API shows incpmplete output HOT 1
- About using Hosted Inference API
- Proper parameters for `HuggingFaceM4/idefics-9b-instruct` HOT 1
- An exception occurs when running the NER model. HOT 1
- No image-to-text task in pipelines!
- [Bug] Audio task accept headers are not respected HOT 2
- any pipeline using huggingface_hub.model_info is not offline compatible
- Adding End-Of-Generation-Token parameter for text generation Inference API HOT 1
- 1xbet
- Bumping docker version of SpeechBrain? HOT 1
- [Bug] Multiple Image Outputs are returned in a single byte buffer
- [Question] Closed source model support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from api-inference-community.