Large Language Models (LLMs) are immensely powerful and can help solve a variety of NLP tasks such as question answering, summarization, entity extraction, and more. As generative AI use-cases continue to expand, often times real-world applications will require the ability to solve multiple of these NLP tasks. For instance if you have a chatbot for users to interface with, a common ask is to summarize the conversation with the chatbot. This can be used in many settings such as doctor-patient transcripts, virtual phone calls/appointments, and more.
How can we build something that solves these types of problems? We could have multiple LLMs, one for question answering and the other for summarization. Another approach would be taking the same LLM and fine-tuning it across the different domains, but we will focus on the former approach for this use-case. With multiple LLMs though there are certain challenges that must be addressed.
Hosting even a singular model is computationally expensive and requires large GPU instances. In the case of having multiple LLMs it’ll require a persistent endpoint/hardware for both. This also leads to overhead with managing multiple endpoints and paying for infrastructure to serve both.
With SageMaker Inference Components we can address this issue. Inference Components allow for you to host multiple different models on a singular endpoint. Each model has its own dedicated container and you can allocate a certain amount of hardware and scale at a per model basis. This allows for us to have both models behind a singular endpoint while optimizing cost and performance.
For today’s article we’ll take a look at how we can build a multi-purpose Generative AI powered chatbot that comes with question answering and summarization enabled. Let’s take a quick look at some of the tools we will use here:
- SageMaker Inference Components: For hosting our models we will be using SageMaker Real-Time Inference. Within Real-Time Inference we will use the Inference Components feature to host multiple models while allocating hardware for each model. If you are new to Inference Components…