AutoArena

AutoArena is an open-source tool that automates head-to-head evaluations using LLM judges to rank GenAI systems. Quickly and accurately generate leaderboards comparing different LLMs, RAG setups, or prompt variations—Fine-tune custom judges to fit your needs.

AutoArena

Product Description

AutoArena is an open-source tool that automates head-to-head evaluations using LLM judges to rank Generative AI systems. It provides fast and accurate rankings by computing Elo scores and Confidence Intervals from multiple judge models, reducing evaluation bias. Users can fine-tune judges for domain-specific tasks and set up automations in their code repository to ensure effective evaluation and integration within development workflows.

Core Features

  • Automated head-to-head evaluations using LLM judges
  • Generating leaderboards for comparing LLMs, RAG setups, or prompt variations
  • Fine-tuning custom judges for specific needs
  • Parallelization, randomization, and other features to enhance evaluation efficiency

Use Cases

  • Evaluate generative AI systems in CI environments
  • Set up automations to prevent bad prompt changes and updates
  • Collaborate on evaluations in cloud or on-premise settings