Human-in-the-Loop ML: Building Quality Control Into AI Pipelines

ML systems fail silently. The model extracts data, writes it to the database, and moves on. If the extraction is wrong, you don't know until a user complains — or worse, makes a decision based on bad data.

The fix isn't better models. It's human-in-the-loop systems that make oversight efficient, not optional.

The Problem with Full Automation

Consider a job scraping pipeline:

Scrape listings → Parse with LLM → Import to database → Show to users

This works until:
- The model misclassifies a night shift as day shift
- A listing for "CNA" gets tagged as "RN" due to text similarity
- A job from the wrong department slips through category filters

Each error erodes user trust. After a few bad recommendations, users stop trusting the system entirely.

The Human-in-the-Loop Pattern

Insert review points at critical junctions:

Scrape → Parse → Queue for Review → Human Decision → Import OR Exclude
                      ↓
                 Exclusion Ledger ← Feeds back to scraper

The key insight: review doesn't mean reviewing everything. It means building systems where humans can efficiently intervene on edge cases.

Building an Exclusion System

For job listings, the exclusion system looks like:

class JobListing < ApplicationRecord
  belongs_to :excluded_by_admin, class_name: "Admin", optional: true

  scope :active, -> { where(excluded_at: nil) }
  scope :excluded, -> { where.not(excluded_at: nil) }

  def exclude!(admin:, reason:, notes: nil)
    update!(
      excluded_at: Time.current,
      excluded_by_admin: admin,
      exclusion_reason: reason,
      exclusion_notes: notes
    )
  end

  def restore!(admin:)
    update!(
      excluded_at: nil,
      excluded_by_admin: nil,
      exclusion_reason: nil,
      exclusion_notes: nil,
      restored_at: Time.current,
      restored_by_admin: admin
    )
  end
end

Exclusions are:
- Attributed — Who excluded it and why
- Reversible — Mistakes can be undone
- Auditable — The history is preserved

The Review Interface

Don't make humans review everything. Surface high-risk items:

class JobListingReviewQueue
  def items_needing_review
    JobListing
      .where(reviewed_at: nil)
      .where("confidence_score < ? OR flagged_for_review = ?", 0.8, true)
      .order(created_at: :desc)
  end

  def auto_approve_high_confidence
    JobListing
      .where(reviewed_at: nil)
      .where("confidence_score >= ?", 0.95)
      .update_all(reviewed_at: Time.current, auto_approved: true)
  end
end

High-confidence items auto-approve. Low-confidence items queue for human review. The threshold is tunable based on your error tolerance.

Feeding Back to the Scraper

Exclusions should prevent re-scraping the same bad data:

class ParkviewScraper
  def filter_excluded_listings!
    excluded_req_numbers = JobListing.excluded.pluck(:parkview_req_number).to_set

    @all_listings.reject! do |listing|
      excluded_req_numbers.include?(listing[:parkview_req_number])
    end
  end
end

Once excluded, a listing doesn't waste processing time or require repeated exclusion.

Measuring Effectiveness

Track exclusion rates to spot systemic issues:

class ExclusionMetrics
  def summary(since: 30.days.ago)
    listings = JobListing.where("created_at > ?", since)

    {
      total: listings.count,
      excluded: listings.excluded.count,
      exclusion_rate: (listings.excluded.count.to_f / listings.count * 100).round(2),
      by_reason: listings.excluded.group(:exclusion_reason).count
    }
  end
end

A rising exclusion rate signals:
- Model degradation
- Source data quality issues
- Category drift in the scrape target

When to Use This Pattern

Human-in-the-loop is essential when:

Errors have real consequences — Bad job recommendations waste candidate time
Model confidence varies — Some extractions are certain, others aren't
Edge cases are hard to enumerate — You can't write rules for everything
Trust is important — Users need to believe the data

The Efficiency Question

"But review doesn't scale!"

True — if you review everything. The goal is efficient review:

Approach	Items Reviewed	Quality
No review	0%	Model accuracy (85-95%)
Full review	100%	Human accuracy (99%+)
Confidence-based	5-15%	Near-human (98%+)

Reviewing 10% of items catches most errors while scaling with volume.

Implementation Checklist

[ ] Exclusion/approval fields on the model
[ ] Admin attribution for decisions
[ ] Confidence scoring from ML pipeline
[ ] Review queue prioritized by risk
[ ] Auto-approval for high-confidence items
[ ] Feedback loop to prevent re-processing
[ ] Metrics dashboard for monitoring

The Takeaway

Production ML isn't about removing humans. It's about leveraging humans efficiently:

Models handle volume (thousands of items per hour)
Humans handle judgment (is this really a nursing position?)
Systems route appropriately (low confidence → human review)

Build the human-in-the-loop infrastructure before you ship. Adding it later means untangling bad data that's already in production.