In spending time with developers building AI applications, one word I keep hearing time and time again is “evals".Evals are defined as structured tests that check how well an AI system performs across scenarios, measuring quality, reliability, and accuracy. The early analog I held in my mind was unit testing. While this is a helpful frame, to end there would be missing the entirety of an AI application’s stochastic, non-deterministic nature.Traditional software has a predefined set of actions a user can take and a finite, bounded output space. AI applications, on the other hand, take in natural language (which can take a seemingly infinite range of forms) and return emergent, open-ended outputs.This demands a new frame where evals ensure the system doesn’t fail in production (playing defense / minimizing downside), while also acting as a mechanism to define “what good looks like” for an AI system and let it iterate against that.More tangibly, an eval boils down to three core components:

  • Task: the AI feature/functionality you want to test, can range from a singular prompt to an agentic workflow
  • Dataset: test cases that the task can iterate over; you want these to be reflective of real production use cases! The best teams bring an engineering mindset to dataset curation
  • Scoring Functions: evaluates whether the model responses are good

What excites me is how this emerging discipline of AI evaluation enables product teams to be more customer-centric. Customer conversations and domain experts can inform both the dataset (the test cases that reflect user behavior) and scoring functions (what does success look like?). Teams can iterate along a CI/CE loop, constantly evaluating their applications and letting traces from real usage feed into these systems.With a robust evaluation system, teams should be able to update their product within a day of a new model’s release, with the evals giving a reasonable estimate for how the product will perform.It will be fascinating to see how both shipping speed and product quality continue to be shaped by a team’s underlying work in evaluation, and what best practices arise at the frontier.

👋 I’m a Researcher at Work-Bench, a Seed stage enterprise-focused VC fund based in New York City. Our sweet spot for investment at Seed correlates with building out a startup’s early go-to-market motions.