ML systems fail silently. The model extracts data, writes it to the database, and moves on. If the extraction is wrong, you don't know until a user complains — or worse, makes a decision based on bad data.
The fix isn't better models. It's human-in-the-loop systems that make oversight efficient, not optional.
The Problem with Full Automation
Consider a job scraping pipeline:
Scrape listings → Parse with LLM → Import to database → Show to users
This works until:
- The model misclassifies a night shift as day shift
- A listing for "CNA" gets tagged as "RN" due to text similarity
- A job from the wrong department slips through category filters
Each error erodes user trust. After a few bad recommendations, users stop trusting the system entirely.
The Human-in-the-Loop Pattern
Insert review points at critical junctions:
Scrape → Parse → Queue for Review → Human Decision → Import OR Exclude
↓
Exclusion Ledger ← Feeds back to scraper
The key insight: review doesn't mean reviewing everything. It means building systems where humans can efficiently intervene on edge cases.
Building an Exclusion System
For job listings, the exclusion system looks like:
class JobListing < ApplicationRecord
belongs_to :excluded_by_admin, class_name: "Admin", optional: true
scope :active, -> { where(excluded_at: nil) }
scope :excluded, -> { where.not(excluded_at: nil) }
def exclude!(admin:, reason:, notes: nil)
update!(
excluded_at: Time.current,
excluded_by_admin: admin,
exclusion_reason: reason,
exclusion_notes: notes
)
end
def restore!(admin:)
update!(
excluded_at: nil,
excluded_by_admin: nil,
exclusion_reason: nil,
exclusion_notes: nil,
restored_at: Time.current,
restored_by_admin: admin
)
end
end
Exclusions are:
- Attributed — Who excluded it and why
- Reversible — Mistakes can be undone
- Auditable — The history is preserved
The Review Interface
Don't make humans review everything. Surface high-risk items:
class JobListingReviewQueue
def items_needing_review
JobListing
.where(reviewed_at: nil)
.where("confidence_score < ? OR flagged_for_review = ?", 0.8, true)
.order(created_at: :desc)
end
def auto_approve_high_confidence
JobListing
.where(reviewed_at: nil)
.where("confidence_score >= ?", 0.95)
.update_all(reviewed_at: Time.current, auto_approved: true)
end
end
High-confidence items auto-approve. Low-confidence items queue for human review. The threshold is tunable based on your error tolerance.
Feeding Back to the Scraper
Exclusions should prevent re-scraping the same bad data:
class ParkviewScraper
def filter_excluded_listings!
excluded_req_numbers = JobListing.excluded.pluck(:parkview_req_number).to_set
@all_listings.reject! do |listing|
excluded_req_numbers.include?(listing[:parkview_req_number])
end
end
end
Once excluded, a listing doesn't waste processing time or require repeated exclusion.
Measuring Effectiveness
Track exclusion rates to spot systemic issues:
class ExclusionMetrics
def summary(since: 30.days.ago)
listings = JobListing.where("created_at > ?", since)
{
total: listings.count,
excluded: listings.excluded.count,
exclusion_rate: (listings.excluded.count.to_f / listings.count * 100).round(2),
by_reason: listings.excluded.group(:exclusion_reason).count
}
end
end
A rising exclusion rate signals:
- Model degradation
- Source data quality issues
- Category drift in the scrape target
When to Use This Pattern
Human-in-the-loop is essential when:
- Errors have real consequences — Bad job recommendations waste candidate time
- Model confidence varies — Some extractions are certain, others aren't
- Edge cases are hard to enumerate — You can't write rules for everything
- Trust is important — Users need to believe the data
The Efficiency Question
"But review doesn't scale!"
True — if you review everything. The goal is efficient review:
| Approach | Items Reviewed | Quality |
|---|---|---|
| No review | 0% | Model accuracy (85-95%) |
| Full review | 100% | Human accuracy (99%+) |
| Confidence-based | 5-15% | Near-human (98%+) |
Reviewing 10% of items catches most errors while scaling with volume.
Implementation Checklist
- [ ] Exclusion/approval fields on the model
- [ ] Admin attribution for decisions
- [ ] Confidence scoring from ML pipeline
- [ ] Review queue prioritized by risk
- [ ] Auto-approval for high-confidence items
- [ ] Feedback loop to prevent re-processing
- [ ] Metrics dashboard for monitoring
The Takeaway
Production ML isn't about removing humans. It's about leveraging humans efficiently:
- Models handle volume (thousands of items per hour)
- Humans handle judgment (is this really a nursing position?)
- Systems route appropriately (low confidence → human review)
Build the human-in-the-loop infrastructure before you ship. Adding it later means untangling bad data that's already in production.