The rapid advancement of LLMs has led to widespread adoption across various domains, but it has also raised concerns about data security and privacy, particularly with publicly available and commercially operated platforms. Given their high computational demands, cloud environments are the obvious choice for deployment. As a result, organizations are increasingly deploying LLMs in confined cloud environments to protect sensitive data while leverazing scalable cloud resources. However, deploying LLMs in cloud environments remains a complex and time-consuming process that requires specialized skills and expertise in various areas, such as infrastructure management, resource allocation, and model setup. Testing and comparing LLMs to select the appropriate one is particularly challenging as different models are trained for different purposes, making the direct comparison nontrivial. Furthermore, differences in model architectures, training data, and fine-tuning strategies make objective evaluation difficult, limiting the effectiveness of traditional benchmarking approaches. To address these challenges, we present a cloud-native system that automates both the deployment and evaluation of LLMs. Our contributions are twofold: (i) we automate the provisioning and deployment of LLMs on various cloud platforms to stream-line infrastructure setup, and (ii) we develop a lightweight evaluation framework that leverages the LLM-as-a-Judge approach, where an independent LLM systematically assesses and compares different models based on predefined evaluation criteria. Our ongoing work aims to optimize LLM deployment by selecting cost-efficient cloud resources. We are also enhancing the evaluation framework with diverse prompts, broader metrics, and cross-model validation for fair, reproducible benchmarking.
Conference paper
2025-01-01T00:00:00+00:00
448 - 450
2