This is part 3 of the AI Recruiting Pipeline Epic.
You've built a structured extractor that parses job descriptions into database fields. It works on your test cases. Then you ship it.
A week later, you discover:
- The model started returning "Unknown" for shifts it previously detected
- A prompt change broke credential extraction
- The new model version has different failure modes
Traditional testing doesn't catch this. Unit tests mock the LLM response. Integration tests are too slow and expensive to run frequently.
You need an evaluation harness: a dataset of real inputs with expected outputs, and automated comparison.
The Test Dataset
Create a YAML file of representative examples:
# spec/fixtures/parser_evaluation.yml
- name: "Standard nursing position"
description: |
We are seeking an RN for our cardiac unit. Day shift, 36 hours per week.
Full benefits eligible. BSN required, ACLS certification preferred.
expected:
shift: "1st"
hours_per_week: 36
benefit_eligible: true
education_names:
- "BSN"
certification_names:
- "ACLS"
- name: "Ambiguous shift information"
description: |
Flexible scheduling available. Must be available for rotating shifts.
Part-time position, 20-30 hours per week.
expected:
shift: "Varies"
hours_per_week: null
benefit_eligible: false
- name: "No explicit requirements"
description: |
Join our team! Great work environment. Competitive pay.
expected:
shift: "Unknown"
hours_per_week: null
benefit_eligible: null
education_names: []
certification_names: []
The dataset covers:
- Happy paths — Clear information, straightforward extraction
- Edge cases — Ambiguous or missing data
- Failure modes — Inputs that previously caused problems
The Evaluator
class ParserEvaluator
attr_reader :dataset_path, :model
def initialize(dataset_path:, model: nil)
@dataset_path = dataset_path
@model = model || ENV.fetch("OPENAI_MODEL", "gpt-4o")
end
def call
samples = YAML.safe_load_file(dataset_path)
sample_reports = samples.map do |sample|
evaluate_sample(sample)
end
{
passed_checks: sample_reports.sum { |r| r[:passed_checks] },
total_checks: sample_reports.sum { |r| r[:total_checks] },
pass_rate: calculate_pass_rate(sample_reports),
samples: sample_reports
}
end
private
def evaluate_sample(sample)
expected = sample.fetch("expected")
result = Parser.new(sample.fetch("description"), model:).call
mismatches = compare_expected(expected, result.data)
{
name: sample.fetch("name"),
parse_status: result.status,
passed_checks: expected.size - mismatches.size,
total_checks: expected.size,
mismatches:
}
end
def compare_expected(expected, actual)
expected.each_with_object([]) do |(field, expected_value), mismatches|
actual_value = actual[field]
next if values_match?(expected_value, actual_value)
mismatches << {
field:,
expected: expected_value,
actual: actual_value
}
end
end
def values_match?(expected, actual)
normalize(expected) == normalize(actual)
end
def normalize(value)
case value
when Array then value.map { |v| normalize(v) }.sort
when Hash then value.transform_values { |v| normalize(v) }.sort.to_h
else value
end
end
end
The evaluator:
1. Loads test cases from YAML
2. Runs each through the real parser
3. Compares extracted fields against expectations
4. Reports mismatches with full context
Running Evaluations
Create a rake task for CI:
namespace :ml do
desc "Evaluate parser accuracy against test dataset"
task evaluate: :environment do
result = ParserEvaluator.new(
dataset_path: Rails.root.join("spec/fixtures/parser_evaluation.yml")
).call
puts "Pass rate: #{result[:pass_rate]}%"
puts "Passed: #{result[:passed_checks]}/#{result[:total_checks]} checks"
result[:samples].each do |sample|
next if sample[:mismatches].empty?
puts "\nFailed: #{sample[:name]}"
sample[:mismatches].each do |mismatch|
puts " #{mismatch[:field]}: expected #{mismatch[:expected].inspect}, got #{mismatch[:actual].inspect}"
end
end
exit(1) if result[:pass_rate] < 90
end
end
Run before deploying prompt changes:
OPENAI_MODEL=gpt-4o-2024-08-06 bin/rails ml:evaluate
Tracking Regressions
Store evaluation results for comparison:
class EvaluationRun < ApplicationRecord
# model, dataset_path, pass_rate, passed_checks, total_checks,
# mismatches_json, created_at
def self.compare_to_baseline(current_result)
baseline = where(model: current_result[:model]).order(created_at: :desc).first
return nil unless baseline
{
pass_rate_delta: current_result[:pass_rate] - baseline.pass_rate,
new_failures: current_result[:samples].select { |s| s[:mismatches].any? }
.map { |s| s[:name] } -
JSON.parse(baseline.mismatches_json).map { |m| m["name"] }
}
end
end
Now you can detect when a model update introduces new failures.
When to Evaluate
- Before prompt changes — Run evaluation, compare to baseline
- On model version changes — gpt-4o-2024-08-06 behaves differently than gpt-4o
- In CI for PR reviews — Block merges that drop pass rate below threshold
- Weekly automated runs — Catch silent model drift
Building the Dataset
Start with failures:
# Find production records that required manual correction
JobListing.where.not(manual_override_at: nil).limit(20).each do |listing|
puts YAML.dump({
"name" => listing.name.truncate(50),
"description" => listing.raw_description,
"expected" => {
"shift" => listing.shift,
"hours_per_week" => listing.hours_per_week,
# ... corrected values
}
})
end
Add edge cases as you discover them. A good dataset has 50-100 samples covering your extraction fields comprehensively.
The Payoff
With an evaluation harness:
- Prompt experiments are measurable — "This change improved certification extraction from 85% to 92%"
- Model updates are safe — Run evaluation before switching versions
- Regressions are caught early — Before users report incorrect data
- Team knowledge accumulates — The dataset documents edge cases
The harness costs a few hours to build and pays for itself on the first caught regression. Build it before your second prompt iteration, not after your first production incident.