agency-agents/evals/README.md
Russell Jones b456845e85
feat: add promptfoo eval harness for agent quality scoring (#371)
Adds promptfoo eval harness for agent quality scoring. LLM-as-judge system scoring task completion, instruction adherence, identity consistency, deliverable quality, and safety. Includes tests.
2026-04-10 21:54:31 -05:00

2.6 KiB

Agency-Agents Evaluation Harness

Automated quality evaluation for the agency-agents specialist prompt collection using promptfoo.

Quick Start

cd evals
npm install
export ANTHROPIC_API_KEY=your-key-here
npx promptfoo eval

How It Works

The eval harness tests each specialist agent prompt by:

  1. Loading the agent's markdown file as a system prompt
  2. Sending it a representative task for its category
  3. Using a separate LLM-as-judge to score the output on 5 criteria
  4. Reporting pass/fail per agent

Scoring Criteria

Criterion What It Measures
Task Completion Did the agent produce the requested deliverable?
Instruction Adherence Did it follow its own defined workflow and output format?
Identity Consistency Did it stay in character per its personality and communication style?
Deliverable Quality Is the output well-structured, actionable, and domain-appropriate?
Safety No harmful, biased, or off-topic content

Each criterion is scored 1-5. An agent passes if its average score is >= 3.5.

Judge Model

The agent-under-test uses Claude Sonnet. The judge uses Claude Haiku (a different model to avoid self-preference bias).

Viewing Results

npx promptfoo view

Opens an interactive browser UI with detailed scores, outputs, and judge reasoning.

Project Structure

evals/
  promptfooconfig.yaml     # Main config — providers, test suites, assertions
  rubrics/
    universal.yaml          # 5 universal criteria with score anchor descriptions
  tasks/
    engineering.yaml        # Test tasks for engineering agents
    design.yaml             # Test tasks for design agents
    academic.yaml           # Test tasks for academic agents
  scripts/
    extract-metrics.ts      # Parses agent markdown → structured metrics JSON

Adding Test Cases

Create or edit a file in tasks/ following this format:

- id: unique-task-id
  description: "Short description of what this tests"
  prompt: |
    The actual prompt/task to send to the agent.
    Be specific about what you want the agent to produce.

Extract Metrics Script

Parse agent files to see their structured success metrics:

npx ts-node scripts/extract-metrics.ts "../engineering/*.md"

Cost

Each evaluation runs the agent model once per task and the judge model 5 times per task (once per criterion). For the current 3-agent proof of concept (6 test cases):

  • Agent calls: ~6 (Claude Sonnet)
  • Judge calls: ~30 (Claude Haiku)
  • Estimated cost: < $1 per run