Evaluation-Driven ML Development: Test Your Parser Before Production

This is part 3 of the AI Recruiting Pipeline Epic.

You've built a structured extractor that parses job descriptions into database fields. It works on your test cases. Then you ship it.

A week later, you discover:

The model started returning "Unknown" for shifts it previously detected
A prompt change broke credential extraction
The new model version has different failure modes

Traditional testing doesn't catch this. Unit tests mock the LLM response. Integration tests are too slow and expensive to run frequently.

You need an evaluation harness: a dataset of real inputs with expected outputs, and automated comparison.

The Test Dataset

Create a YAML file of representative examples:

# spec/fixtures/parser_evaluation.yml
- name: "Standard nursing position"
  description: |
    We are seeking an RN for our cardiac unit. Day shift, 36 hours per week.
    Full benefits eligible. BSN required, ACLS certification preferred.
  expected:
    shift: "1st"
    hours_per_week: 36
    benefit_eligible: true
    education_names:
      - "BSN"
    certification_names:
      - "ACLS"

- name: "Ambiguous shift information"
  description: |
    Flexible scheduling available. Must be available for rotating shifts.
    Part-time position, 20-30 hours per week.
  expected:
    shift: "Varies"
    hours_per_week: null
    benefit_eligible: false

- name: "No explicit requirements"
  description: |
    Join our team! Great work environment. Competitive pay.
  expected:
    shift: "Unknown"
    hours_per_week: null
    benefit_eligible: null
    education_names: []
    certification_names: []

The dataset covers:
- Happy paths — Clear information, straightforward extraction
- Edge cases — Ambiguous or missing data
- Failure modes — Inputs that previously caused problems

The Evaluator

class ParserEvaluator
  attr_reader :dataset_path, :model

  def initialize(dataset_path:, model: nil)
    @dataset_path = dataset_path
    @model = model || ENV.fetch("OPENAI_MODEL", "gpt-4o")
  end

  def call
    samples = YAML.safe_load_file(dataset_path)

    sample_reports = samples.map do |sample|
      evaluate_sample(sample)
    end

    {
      passed_checks: sample_reports.sum { |r| r[:passed_checks] },
      total_checks: sample_reports.sum { |r| r[:total_checks] },
      pass_rate: calculate_pass_rate(sample_reports),
      samples: sample_reports
    }
  end

  private

  def evaluate_sample(sample)
    expected = sample.fetch("expected")
    result = Parser.new(sample.fetch("description"), model:).call

    mismatches = compare_expected(expected, result.data)

    {
      name: sample.fetch("name"),
      parse_status: result.status,
      passed_checks: expected.size - mismatches.size,
      total_checks: expected.size,
      mismatches:
    }
  end

  def compare_expected(expected, actual)
    expected.each_with_object([]) do |(field, expected_value), mismatches|
      actual_value = actual[field]
      next if values_match?(expected_value, actual_value)

      mismatches << {
        field:,
        expected: expected_value,
        actual: actual_value
      }
    end
  end

  def values_match?(expected, actual)
    normalize(expected) == normalize(actual)
  end

  def normalize(value)
    case value
    when Array then value.map { |v| normalize(v) }.sort
    when Hash then value.transform_values { |v| normalize(v) }.sort.to_h
    else value
    end
  end
end

The evaluator:
1. Loads test cases from YAML
2. Runs each through the real parser
3. Compares extracted fields against expectations
4. Reports mismatches with full context

Running Evaluations

Create a rake task for CI:

namespace :ml do
  desc "Evaluate parser accuracy against test dataset"
  task evaluate: :environment do
    result = ParserEvaluator.new(
      dataset_path: Rails.root.join("spec/fixtures/parser_evaluation.yml")
    ).call

    puts "Pass rate: #{result[:pass_rate]}%"
    puts "Passed: #{result[:passed_checks]}/#{result[:total_checks]} checks"

    result[:samples].each do |sample|
      next if sample[:mismatches].empty?

      puts "\nFailed: #{sample[:name]}"
      sample[:mismatches].each do |mismatch|
        puts "  #{mismatch[:field]}: expected #{mismatch[:expected].inspect}, got #{mismatch[:actual].inspect}"
      end
    end

    exit(1) if result[:pass_rate] < 90
  end
end

Run before deploying prompt changes:

OPENAI_MODEL=gpt-4o-2024-08-06 bin/rails ml:evaluate

Tracking Regressions

Store evaluation results for comparison:

class EvaluationRun < ApplicationRecord
  # model, dataset_path, pass_rate, passed_checks, total_checks,
  # mismatches_json, created_at

  def self.compare_to_baseline(current_result)
    baseline = where(model: current_result[:model]).order(created_at: :desc).first
    return nil unless baseline

    {
      pass_rate_delta: current_result[:pass_rate] - baseline.pass_rate,
      new_failures: current_result[:samples].select { |s| s[:mismatches].any? }
                                            .map { |s| s[:name] } -
                    JSON.parse(baseline.mismatches_json).map { |m| m["name"] }
    }
  end
end

Now you can detect when a model update introduces new failures.

When to Evaluate

Before prompt changes — Run evaluation, compare to baseline
On model version changes — gpt-4o-2024-08-06 behaves differently than gpt-4o
In CI for PR reviews — Block merges that drop pass rate below threshold
Weekly automated runs — Catch silent model drift

Building the Dataset

Start with failures:

# Find production records that required manual correction
JobListing.where.not(manual_override_at: nil).limit(20).each do |listing|
  puts YAML.dump({
    "name" => listing.name.truncate(50),
    "description" => listing.raw_description,
    "expected" => {
      "shift" => listing.shift,
      "hours_per_week" => listing.hours_per_week,
      # ... corrected values
    }
  })
end

Add edge cases as you discover them. A good dataset has 50-100 samples covering your extraction fields comprehensively.

The Payoff

With an evaluation harness:

Prompt experiments are measurable — "This change improved certification extraction from 85% to 92%"
Model updates are safe — Run evaluation before switching versions
Regressions are caught early — Before users report incorrect data
Team knowledge accumulates — The dataset documents edge cases

The harness costs a few hours to build and pays for itself on the first caught regression. Build it before your second prompt iteration, not after your first production incident.