mirror of https://github.com/msitarzewski/agency-agents synced 2026-04-25 11:18:05 +00:00

History

feat: add promptfoo eval harness for agent quality scoring (#371 )

Adds promptfoo eval harness for agent quality scoring. LLM-as-judge system scoring task completion, instruction adherence, identity consistency, deliverable quality, and safety. Includes tests.

2026-04-10 21:54:31 -05:00

rubrics

feat: add promptfoo eval harness for agent quality scoring (#371 )

2026-04-10 21:54:31 -05:00

scripts

feat: add promptfoo eval harness for agent quality scoring (#371 )

2026-04-10 21:54:31 -05:00

tasks

feat: add promptfoo eval harness for agent quality scoring (#371 )

2026-04-10 21:54:31 -05:00

.gitignore

feat: add promptfoo eval harness for agent quality scoring (#371 )

2026-04-10 21:54:31 -05:00

package.json

feat: add promptfoo eval harness for agent quality scoring (#371 )

2026-04-10 21:54:31 -05:00

promptfooconfig.yaml

feat: add promptfoo eval harness for agent quality scoring (#371 )

2026-04-10 21:54:31 -05:00

README.md

feat: add promptfoo eval harness for agent quality scoring (#371 )

2026-04-10 21:54:31 -05:00

tsconfig.json

feat: add promptfoo eval harness for agent quality scoring (#371 )

2026-04-10 21:54:31 -05:00

README.md

Agency-Agents Evaluation Harness

Automated quality evaluation for the agency-agents specialist prompt collection using promptfoo.

Quick Start

cd evals
npm install
export ANTHROPIC_API_KEY=your-key-here
npx promptfoo eval

How It Works

The eval harness tests each specialist agent prompt by:

Loading the agent's markdown file as a system prompt
Sending it a representative task for its category
Using a separate LLM-as-judge to score the output on 5 criteria
Reporting pass/fail per agent

Scoring Criteria

Criterion	What It Measures
Task Completion	Did the agent produce the requested deliverable?
Instruction Adherence	Did it follow its own defined workflow and output format?
Identity Consistency	Did it stay in character per its personality and communication style?
Deliverable Quality	Is the output well-structured, actionable, and domain-appropriate?
Safety	No harmful, biased, or off-topic content

Each criterion is scored 1-5. An agent passes if its average score is >= 3.5.

Judge Model

The agent-under-test uses Claude Sonnet. The judge uses Claude Haiku (a different model to avoid self-preference bias).

Viewing Results

npx promptfoo view

Opens an interactive browser UI with detailed scores, outputs, and judge reasoning.

Project Structure

evals/
  promptfooconfig.yaml     # Main config — providers, test suites, assertions
  rubrics/
    universal.yaml          # 5 universal criteria with score anchor descriptions
  tasks/
    engineering.yaml        # Test tasks for engineering agents
    design.yaml             # Test tasks for design agents
    academic.yaml           # Test tasks for academic agents
  scripts/
    extract-metrics.ts      # Parses agent markdown → structured metrics JSON

Adding Test Cases

Create or edit a file in tasks/ following this format:

- id: unique-task-id
  description: "Short description of what this tests"
  prompt: |
    The actual prompt/task to send to the agent.
    Be specific about what you want the agent to produce.

Extract Metrics Script

Parse agent files to see their structured success metrics:

npx ts-node scripts/extract-metrics.ts "../engineering/*.md"

Cost

Each evaluation runs the agent model once per task and the judge model 5 times per task (once per criterion). For the current 3-agent proof of concept (6 test cases):

Agent calls: ~6 (Claude Sonnet)
Judge calls: ~30 (Claude Haiku)
Estimated cost: < $1 per run