Building Production Scrapers: Concurrency, Checkpoints, and Failure Recovery

Development scrapers work differently than production scrapers. In development, you scrape 10 pages, verify the output, and iterate. In production, you scrape 2,000 pages, encounter rate limits, timeouts, and malformed HTML — and need to recover gracefully.

This article covers the patterns that make scrapers production-ready.

The Architecture

┌────────────────┐     ┌──────────────┐     ┌──────────────┐
│  Pagination    │ →   │   Worker     │ →   │  Checkpoint  │
│  Collector     │     │   Pool       │     │   Writer     │
└────────────────┘     └──────────────┘     └──────────────┘
        ↓                     ↓                    ↓
┌────────────────┐     ┌──────────────┐     ┌──────────────┐
│  All Listings  │     │  Semaphore   │     │  Resume CSV  │
│  (in memory)   │     │  (rate limit)│     │  (on disk)   │
└────────────────┘     └──────────────┘     └──────────────┘

Three concerns, separated:
1. Collection — Gather all URLs to process
2. Processing — Concurrent workers with rate limiting
3. Persistence — Checkpoint progress for resume

Concurrent HTTP Fetching

Sequential fetching is too slow. Concurrent fetching with a worker pool:

def process_listings(listings)
  queue = Queue.new
  listings.each { |listing| queue << listing }

  results = []
  mutex = Mutex.new

  workers = Array.new(concurrency) do
    Thread.new do
      loop do
        listing = queue.pop(true) rescue break
        result = process_listing(listing)
        mutex.synchronize { results << result }
      end
    end
  end

  workers.each(&:join)
  results
end

The worker count (concurrency) balances speed against target server capacity. Start with 4-8 workers, increase if the target handles it.

Rate Limiting LLM Calls

LLM APIs have stricter rate limits than web servers. Use a semaphore:

class Semaphore
  def initialize(permits)
    @permits = permits
    @mutex = Mutex.new
    @condition = ConditionVariable.new
  end

  def synchronize
    acquire
    yield
  ensure
    release
  end

  private

  def acquire
    @mutex.synchronize do
      @condition.wait(@mutex) until @permits > 0
      @permits -= 1
    end
  end

  def release
    @mutex.synchronize do
      @permits += 1
      @condition.signal
    end
  end
end

# Usage
@openai_semaphore = Semaphore.new(2) # Max 2 concurrent LLM calls

def parse_with_llm(text)
  @openai_semaphore.synchronize do
    LLM::Provider::OpenAI.new(messages: [...]).call
  end
end

HTTP fetching runs at 8x concurrency. LLM calls run at 2x. The semaphore prevents rate limit errors.

Checkpoint and Resume

Long scrapes fail. Network errors, rate limits, process crashes. Checkpointing lets you resume:

class CheckpointWriter
  def initialize(path)
    @path = path
    @mutex = Mutex.new
  end

  def append(data)
    @mutex.synchronize do
      CSV.open(@path, "a") do |csv|
        csv << data.values
      end
    end
  end

  def processed_ids
    return Set.new unless File.exist?(@path)

    Set.new(CSV.read(@path).map { |row| row[0] })
  end

  def reset!
    FileUtils.rm_f(@path)
  end
end

On startup, load already-processed IDs:

def run
  writer = CheckpointWriter.new(resume_path)
  processed = writer.processed_ids
  remaining = all_listings.reject { |l| processed.include?(l[:id]) }

  puts "Resuming: #{processed.size} done, #{remaining.size} remaining"

  remaining.each do |listing|
    result = process_listing(listing)
    writer.append(result)
    processed << listing[:id]
  end
end

If the process dies at item 500 of 2000, restart and it picks up at 501.

Validating Checkpoints

Stale checkpoints cause problems. If the source data changed, the checkpoint may reference items that no longer exist:

def prepare_checkpoint!(writer, current_ids)
  processed = writer.processed_ids
  stale = processed - current_ids

  if stale.any?
    puts "Checkpoint stale (#{stale.size} items removed). Starting fresh."
    writer.reset!
    return Set.new
  end

  processed
end

Retry Logic

Transient failures need retries. Permanent failures need graceful degradation:

def fetch_with_retry(url, attempts: 3)
  attempts.times do |i|
    response = http_client.get(url)
    return response.body if response.success?

    if response.status == 429 # Rate limited
      sleep(retry_after(response) || 2**i)
      next
    end

    break if response.status >= 400 && response.status < 500 # Client error, don't retry
  end

  nil # Return nil on failure, let caller handle
end

def process_listing(listing)
  html = fetch_with_retry(listing[:url])
  return empty_result(listing, "fetch_failed") unless html

  parse_html(html, listing)
rescue StandardError => e
  log_error(listing, e)
  empty_result(listing, "parse_error")
end

Failed items get empty results with status flags. You can identify and retry them later.

Progress Reporting

Long-running tasks need visibility:

class ProgressReporter
  def initialize
    @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    @last_update = @start_time
  end

  def update(message)
    now = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    return if now - @last_update < 0.5 # Throttle updates

    elapsed = (now - @start_time).round
    print "\r#{message} (#{format_duration(elapsed)})"
    @last_update = now
  end

  def finish(message)
    elapsed = (Process.clock_gettime(Process::CLOCK_MONOTONIC) - @start_time).round
    puts "\n#{message} in #{format_duration(elapsed)}"
  end

  private

  def format_duration(seconds)
    mins, secs = seconds.divmod(60)
    mins > 0 ? "#{mins}m #{secs}s" : "#{secs}s"
  end
end

Putting It Together

class ProductionScraper
  def initialize(concurrency: 8, llm_concurrency: 2)
    @http_concurrency = concurrency
    @llm_semaphore = Semaphore.new(llm_concurrency)
    @reporter = ProgressReporter.new
  end

  def run
    @reporter.update("Collecting listings...")
    all_listings = collect_all_listings

    writer = CheckpointWriter.new(resume_path)
    processed = prepare_checkpoint!(writer, all_listings.map { |l| l[:id] }.to_set)
    remaining = all_listings.reject { |l| processed.include?(l[:id]) }

    @reporter.update("Processing #{remaining.size} listings...")
    process_concurrently(remaining, writer, processed)

    finalize!(writer)
    @reporter.finish("Completed #{processed.size} listings")
  end
end

The Checklist

Production scrapers need:

[ ] Concurrent HTTP fetching with configurable worker count
[ ] Semaphore-limited LLM calls
[ ] Checkpoint file for resume after failure
[ ] Checkpoint validation against current source data
[ ] Retry logic with exponential backoff
[ ] Progress reporting with elapsed time
[ ] Graceful degradation for failed items
[ ] Final output separate from checkpoint file

Build these patterns once. Reuse them across every scraping project.