Skip to main content

Build an automated large language model evaluation pipeline on AWS

Large Language Models (LLMs) have gained significant attention as the key tools for understanding, generating and manipulating text with unprecedented proficiency. Their potential applications span from conversational agents to content generation and information retrieval. However, maximizing LLM capabilities, while ensuring responsible and effective use of these models hinges on the critical process of LLM evaluation. Join us as we dive into the solution framework and demonstrate how you can efficiently evaluate different LLMs and prompt templates by temporarily launching endpoints and running test sets. We show how the evaluation process is automated by converting LLM evaluation into a classification problem, where a test LLM assesses the output of the first LLM, similar to human evaluators, thus saving significant costs and resources during the evaluation stage. Download slide »

Melanie Li, PhD, Senior AI/ML Specialist Technical Account Manager, AWS
Sam Edwards, Cloud Support Engineer, AWS
Rafa Xu, Senior Cloud Architect, AWS Professional Services