Kilo Launches KiloBench, a Real-World Benchmarking Tool for AI Coding Agents
Kilo, the open‑source AI coding agent that already supports more than 500 models, today released KiloBench to help DevOps teams decide which model to use for a given task. The tool tracks real‑world success rates, token usage, and cost, giving teams granular data on how often a model succeeds on the first attempt, how many retries are needed, and how many tokens are consumed.
Unlike generic benchmarks such as SWE‑bench Verified, which evaluate models on a curated set of coding problems, KiloBench focuses on production‑style tasks. According to Kilo’s CEO Scott Breitenother, the framework “measures the impact that frontier AI models are having on actual production workflows versus relying on a set of benchmarks.” Breitenother added that KiloBench “provides deeper insights into workflows that make it possible to determine which AI model lends itself best to a specific harness in terms of how quickly tasks are completed and at what cost.”
Built on the Terminal‑Bench framework, KiloBench tracks 89 realistic command‑line tasks as they are completed using the Kilo harness. This approach gives developers detailed metrics on task completion, retries, and token consumption. Breitenother noted that “some models are able to successfully complete a task on their first try, while others require three attempts. A model that’s cheaper per attempt but needs five tries is more expensive than a model that costs more per attempt but consumes fewer tokens.” The data can be used to automatically route tasks based on policies set by a DevOps team.
Token cost is a growing concern for organizations that rely on AI for code generation. The Kilo team stresses that “the cost of the token used to build and deploy software in many instances has become prohibitive.” By providing detailed cost and performance metrics, KiloBench enables teams to apply FinOps practices to AI infrastructure usage. Breitenother warned that without such governance, “more organizations will discover that application developers are regularly running out of allotted tokens that will need to be replenished.”
Kilo’s open‑source agent, which runs locally or in the cloud, already supports a wide range of models, including proprietary ones from OpenAI and Anthropic, as well as open‑source alternatives. The company’s statement that “too many organizations are making a strategic mistake by adopting harnesses that are too closely coupled with the provider of an AI model” reflects a broader industry trend toward model‑agnostic tooling.
While AI coding tools are still in the early stages of adoption, the release of KiloBench signals a shift toward more data‑driven model selection. According to reports, most organizations now use multiple AI tools to generate code but few have defined governance policies or frameworks to optimize usage. KiloBench aims to fill that gap by offering a benchmark that is tied directly to real coding tasks rather than synthetic tests.
In the coming months, Kilo plans to expand KiloBench’s capabilities to support additional harnesses and to integrate with cloud cost‑optimization platforms. The company has not announced a specific rollout schedule, but the framework is already available to developers who use Kilo Code in VS Code, JetBrains, or the command line.
At present, KiloBench represents the first step toward a more systematic approach to AI model evaluation in software engineering. Its adoption will likely grow as organizations seek to control token costs and improve the reliability of AI‑assisted development.