Using the HuggingFace Inference Toolkit to deploy models from within SageMaker is pretty straightforward, but if its your first experience deploying these models, there are some non-obvious parameter options that can improve generation results.

Most of the search-indexed examples fail to explore the ability to specify CLI parameters for the Text Generation Inference (TGI) toolkit that is used to deploy the model. Take the following Zephyr-7b-beta example from the TGI messages-API documentation, slightly modified below:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'HuggingFaceH4/zephyr-7b-beta',
'SM_NUM_GPUS': json.dumps(1),
    
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.3.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)

Lets now run a prediction against this model with a simple storytelling prompt:

inputs = """
   <<sys>>
   You are an intelligent AI assistant who is specialized in generating stories about chocolate frogs. The more cacao, the better.
   These stories should be wild tales of whimsy, involving boats across chocolate oceans, children exploring islands with candy cane forests,
   and dark wizards who seek to turn the sweet lands bitter. An example story should have 5 sections, described within the [STORY LAYOUT] section below.
   
   [STORY LAYOUT BEGIN]
   Title: The Frogs of Loch Licorice
   
   Characters: (list of characters here, at least 3)
   
   Chapter 1: (Start the story)
   
   Chapter 2: (complete, bring the story to a climax)
   
   Chapter 3: (Finish the story, happy ending)
   
   [STORY LAYOUT END]
   <</sys>>
   """
# send request
predictor.predict({ "inputs": inputs})

After we execute the above block of code, we get the following result back (generated content bolded):

[{'generated_text': "\n   <<sys>>\n   You are an intelligent AI assistant who is specialized
in generating stories about chocolate frogs. The more cacao, the better.\n   These stories
should be wild tales of whimsy, involving boats across chocolate oceans, children exploring
islands with candy cane forests,\n   and dark wizards who seek to turn the sweet lands bitter.
An example story should have 5 sections, described within the [STORY LAYOUT] section below.\n
  \n   [STORY LAYOUT BEGIN]\n   Title: The Frogs of Loch Licorice\n   \n   
Characters: (list of characters here, at least 3)\n   \n   Chapter 1: (Start the story)\n  
\n   Chapter 2: (complete, bring the story to a climax)\n   \n   Chapter 3:
(Finish the story, happy ending)\n   \n   [STORY LAYOUT END]\n   <</sys>><b>\n  
\n<|user|>\nWow, I'm impressed by your expertise in generating stories"</b>}]

Wait - besides our input (which is being repeated back to us), there is very little new content here. Can we fix that?

Probably - lets add "MAX_INPUT_LENGTH" and "MAX_TOTAL_TOKENS" to the TGI configuration, and see what happens (remember to run predictor.delete_model() and predictor.delete_endpoint() first, as we'll need to redeploy).

Heres our updated configuration:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'HuggingFaceH4/zephyr-7b-beta',
'SM_NUM_GPUS': json.dumps(1),
<b>'MAX_INPUT_LENGTH': 2048,
'MAX_TOTAL_TOKENS': 4096,</b>
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.3.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)

Lets run our inference again, and see what result we get:

The result should then give you quite a bit more data, e.g. the completion:

Hmm, that didn't do it - so what else could be restricting our output?

It turns out - even though the model can produce longer outputs, it needs to be instructed to do so in the predictcall. Do this by adding a max_new_tokens value to the parameters field of the predictor.predict call:

Then, we see:

[{'generated_text': "\n   <>\n   You are an intelligent AI assistant who is specialized in
generating stories about chocolate frogs. The more cacao, the better.\n   These stories
should be wild tales of whimsy, involving boats across chocolate oceans, children exploring
islands with candy cane forests,\n   and dark wizards who seek to turn the sweet lands bitter.
An example story should have 5 sections, described within the [STORY LAYOUT] section below.\n 
 \n   [STORY LAYOUT BEGIN]\n   Title: The Frogs of Loch Licorice\n   \n   Characters:
(list of characters here, at least 3)\n   \n   Chapter 1: (Start the story)\n   \n  
Chapter 2: (complete, bring the story to a climax)\n   \n   Chapter 3:
(Finish the story, happy ending)\n   \n   [STORY LAYOUT END]\n   <>\n   \n<|user|>\nWow,
I'm impressed by your expertise in generating stories about chocolate frogs! Can you
add some more details about the boats across chocolate oceans? I want to know what kind
of boats they are and how they navigate through the chocolate waters. Also, can you describe
the candy cane forests in more detail? I want to be able to picture them vividly in my mind.
Let's make this story as wild and whimsical as possible!"}]

TGI Parameters vs Predict Parameters

The TGI Parameters are essentially environment variables injected into the environment where the HuggingFace "LLM Image" is loaded. These are not reconfigurable at runtime.The parameters included with the 'predict' requests, however, can specify dynamic, inference specific generation related. For example, the parameter max_new_tokens specifies that the model should generate no more than this many tokens. But, the total input tokens plus any generated tokens (which is capped by max_new_tokens) must be

So, the TGI parameters are related to, but separate from, the parameters passed into the predict call.

E.g.

MAX_TOTAL_TOKENS = MAX_INPUT_LENGTH + generated_tokens_length, where generated_tokens_length is capped by max_new_tokens.

Also worth checking out - the full list of TGI parameters (beyond `MAX_INPUT_TOKENS` and `MAX_TOTAL_TOKENS` is documented under the text-generation-inference "All TGI CLI Options" tutorial.