Using the HuggingFace Inference Toolkit to deploy models from within SageMaker is pretty straightforward, but if its your first experience deploying these models, there are some non-obvious parameter options that can improve generation results.
Most of the search-indexed examples fail to explore the ability to specify CLI parameters for the Text Generation Inference (TGI) toolkit that is used to deploy the model. Take the following Zephyr-7b-beta example from the TGI messages-API documentation, slightly modified below:
import json import sagemaker import boto3 from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] # Hub Model configuration. https://huggingface.co/models hub = { 'HF_MODEL_ID':'HuggingFaceH4/zephyr-7b-beta', 'SM_NUM_GPUS': json.dumps(1), } # create Hugging Face Model Class huggingface_model = HuggingFaceModel( image_uri=get_huggingface_llm_image_uri("huggingface",version="1.3.3"), env=hub, role=role, ) # deploy model to SageMaker Inference predictor = huggingface_model.deploy( initial_instance_count=1, instance_type="ml.g5.2xlarge", container_startup_health_check_timeout=300, )
Lets now run a prediction against this model with a simple storytelling prompt:
inputs = """ <<sys>> You are an intelligent AI assistant who is specialized in generating stories about chocolate frogs. The more cacao, the better. These stories should be wild tales of whimsy, involving boats across chocolate oceans, children exploring islands with candy cane forests, and dark wizards who seek to turn the sweet lands bitter. An example story should have 5 sections, described within the [STORY LAYOUT] section below. [STORY LAYOUT BEGIN] Title: The Frogs of Loch Licorice Characters: (list of characters here, at least 3) Chapter 1: (Start the story) Chapter 2: (complete, bring the story to a climax) Chapter 3: (Finish the story, happy ending) [STORY LAYOUT END] <</sys>> """ # send request predictor.predict({ "inputs": inputs})
After we execute the above block of code, we get the following result back (generated content bolded):
[{'generated_text': "\n <<sys>>\n You are an intelligent AI assistant who is specialized in generating stories about chocolate frogs. The more cacao, the better.\n These stories should be wild tales of whimsy, involving boats across chocolate oceans, children exploring islands with candy cane forests,\n and dark wizards who seek to turn the sweet lands bitter. An example story should have 5 sections, described within the [STORY LAYOUT] section below.\n \n [STORY LAYOUT BEGIN]\n Title: The Frogs of Loch Licorice\n \n Characters: (list of characters here, at least 3)\n \n Chapter 1: (Start the story)\n \n Chapter 2: (complete, bring the story to a climax)\n \n Chapter 3: (Finish the story, happy ending)\n \n [STORY LAYOUT END]\n <</sys>><b>\n \n<|user|>\nWow, I'm impressed by your expertise in generating stories"</b>}]
Wait - besides our input (which is being repeated back to us), there is very little new content here. Can we fix that?
Probably - lets add "MAX_INPUT_LENGTH" and "MAX_TOTAL_TOKENS" to the TGI configuration, and see what happens (remember to run predictor.delete_model() and predictor.delete_endpoint()
first, as we'll need to redeploy).
Heres our updated configuration:
import json import sagemaker import boto3 from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] # Hub Model configuration. https://huggingface.co/models hub = { 'HF_MODEL_ID':'HuggingFaceH4/zephyr-7b-beta', 'SM_NUM_GPUS': json.dumps(1), <b>'MAX_INPUT_LENGTH': 2048, 'MAX_TOTAL_TOKENS': 4096,</b> } # create Hugging Face Model Class huggingface_model = HuggingFaceModel( image_uri=get_huggingface_llm_image_uri("huggingface",version="1.3.3"), env=hub, role=role, ) # deploy model to SageMaker Inference predictor = huggingface_model.deploy( initial_instance_count=1, instance_type="ml.g5.2xlarge", container_startup_health_check_timeout=300, )
Lets run our inference again, and see what result we get:
The result should then give you quite a bit more data, e.g. the completion:
Hmm, that didn't do it - so what else could be restricting our output?
It turns out - even though the model can produce longer outputs, it needs to be instructed to do so in the predict
call. Do this by adding a max_new_tokens
value to the parameters
field of the predictor.predict
call:
Then, we see:
[{'generated_text': "\n <>\n You are an intelligent AI assistant who is specialized in generating stories about chocolate frogs. The more cacao, the better.\n These stories should be wild tales of whimsy, involving boats across chocolate oceans, children exploring islands with candy cane forests,\n and dark wizards who seek to turn the sweet lands bitter. An example story should have 5 sections, described within the [STORY LAYOUT] section below.\n \n [STORY LAYOUT BEGIN]\n Title: The Frogs of Loch Licorice\n \n Characters: (list of characters here, at least 3)\n \n Chapter 1: (Start the story)\n \n Chapter 2: (complete, bring the story to a climax)\n \n Chapter 3: (Finish the story, happy ending)\n \n [STORY LAYOUT END]\n <>\n \n<|user|>\nWow, I'm impressed by your expertise in generating stories about chocolate frogs! Can you add some more details about the boats across chocolate oceans? I want to know what kind of boats they are and how they navigate through the chocolate waters. Also, can you describe the candy cane forests in more detail? I want to be able to picture them vividly in my mind. Let's make this story as wild and whimsical as possible!"}]
TGI Parameters vs Predict Parameters
The TGI Parameters are essentially environment variables injected into the environment where the HuggingFace "LLM Image" is loaded. These are not reconfigurable at runtime.The parameters included with the 'predict' requests, however, can specify dynamic, inference specific generation related. For example, the parameter max_new_tokens
specifies that the model should generate no more than this many tokens. But, the total input tokens plus any generated tokens (which is capped by max_new_tokens
) must be
So, the TGI parameters are related to, but separate from, the parameters passed into the predict
call.
E.g.
MAX_TOTAL_TOKENS = MAX_INPUT_LENGTH + generated_tokens_length
, where generated_tokens_length
is capped by max_new_tokens
.
Also worth checking out - the full list of TGI parameters (beyond `MAX_INPUT_TOKENS` and `MAX_TOTAL_TOKENS` is documented under the text-generation-inference "All TGI CLI Options" tutorial.