Development scrapers work differently than production scrapers. In development, you scrape 10 pages, verify the output, and iterate. In production, you scrape 2,000 pages, encounter rate limits, timeouts, and malformed HTML — and need to recover gracefully.
This article covers the patterns that make scrapers production-ready.
The Architecture
┌────────────────┐ ┌──────────────┐ ┌──────────────┐
│ Pagination │ → │ Worker │ → │ Checkpoint │
│ Collector │ │ Pool │ │ Writer │
└────────────────┘ └──────────────┘ └──────────────┘
↓ ↓ ↓
┌────────────────┐ ┌──────────────┐ ┌──────────────┐
│ All Listings │ │ Semaphore │ │ Resume CSV │
│ (in memory) │ │ (rate limit)│ │ (on disk) │
└────────────────┘ └──────────────┘ └──────────────┘
Three concerns, separated:
1. Collection — Gather all URLs to process
2. Processing — Concurrent workers with rate limiting
3. Persistence — Checkpoint progress for resume
Concurrent HTTP Fetching
Sequential fetching is too slow. Concurrent fetching with a worker pool:
def process_listings(listings)
queue = Queue.new
listings.each { |listing| queue << listing }
results = []
mutex = Mutex.new
workers = Array.new(concurrency) do
Thread.new do
loop do
listing = queue.pop(true) rescue break
result = process_listing(listing)
mutex.synchronize { results << result }
end
end
end
workers.each(&:join)
results
end
The worker count (concurrency) balances speed against target server capacity. Start with 4-8 workers, increase if the target handles it.
Rate Limiting LLM Calls
LLM APIs have stricter rate limits than web servers. Use a semaphore:
class Semaphore
def initialize(permits)
@permits = permits
@mutex = Mutex.new
@condition = ConditionVariable.new
end
def synchronize
acquire
yield
ensure
release
end
private
def acquire
@mutex.synchronize do
@condition.wait(@mutex) until @permits > 0
@permits -= 1
end
end
def release
@mutex.synchronize do
@permits += 1
@condition.signal
end
end
end
# Usage
@openai_semaphore = Semaphore.new(2) # Max 2 concurrent LLM calls
def parse_with_llm(text)
@openai_semaphore.synchronize do
LLM::Provider::OpenAI.new(messages: [...]).call
end
end
HTTP fetching runs at 8x concurrency. LLM calls run at 2x. The semaphore prevents rate limit errors.
Checkpoint and Resume
Long scrapes fail. Network errors, rate limits, process crashes. Checkpointing lets you resume:
class CheckpointWriter
def initialize(path)
@path = path
@mutex = Mutex.new
end
def append(data)
@mutex.synchronize do
CSV.open(@path, "a") do |csv|
csv << data.values
end
end
end
def processed_ids
return Set.new unless File.exist?(@path)
Set.new(CSV.read(@path).map { |row| row[0] })
end
def reset!
FileUtils.rm_f(@path)
end
end
On startup, load already-processed IDs:
def run
writer = CheckpointWriter.new(resume_path)
processed = writer.processed_ids
remaining = all_listings.reject { |l| processed.include?(l[:id]) }
puts "Resuming: #{processed.size} done, #{remaining.size} remaining"
remaining.each do |listing|
result = process_listing(listing)
writer.append(result)
processed << listing[:id]
end
end
If the process dies at item 500 of 2000, restart and it picks up at 501.
Validating Checkpoints
Stale checkpoints cause problems. If the source data changed, the checkpoint may reference items that no longer exist:
def prepare_checkpoint!(writer, current_ids)
processed = writer.processed_ids
stale = processed - current_ids
if stale.any?
puts "Checkpoint stale (#{stale.size} items removed). Starting fresh."
writer.reset!
return Set.new
end
processed
end
Retry Logic
Transient failures need retries. Permanent failures need graceful degradation:
def fetch_with_retry(url, attempts: 3)
attempts.times do |i|
response = http_client.get(url)
return response.body if response.success?
if response.status == 429 # Rate limited
sleep(retry_after(response) || 2**i)
next
end
break if response.status >= 400 && response.status < 500 # Client error, don't retry
end
nil # Return nil on failure, let caller handle
end
def process_listing(listing)
html = fetch_with_retry(listing[:url])
return empty_result(listing, "fetch_failed") unless html
parse_html(html, listing)
rescue StandardError => e
log_error(listing, e)
empty_result(listing, "parse_error")
end
Failed items get empty results with status flags. You can identify and retry them later.
Progress Reporting
Long-running tasks need visibility:
class ProgressReporter
def initialize
@start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
@last_update = @start_time
end
def update(message)
now = Process.clock_gettime(Process::CLOCK_MONOTONIC)
return if now - @last_update < 0.5 # Throttle updates
elapsed = (now - @start_time).round
print "\r#{message} (#{format_duration(elapsed)})"
@last_update = now
end
def finish(message)
elapsed = (Process.clock_gettime(Process::CLOCK_MONOTONIC) - @start_time).round
puts "\n#{message} in #{format_duration(elapsed)}"
end
private
def format_duration(seconds)
mins, secs = seconds.divmod(60)
mins > 0 ? "#{mins}m #{secs}s" : "#{secs}s"
end
end
Putting It Together
class ProductionScraper
def initialize(concurrency: 8, llm_concurrency: 2)
@http_concurrency = concurrency
@llm_semaphore = Semaphore.new(llm_concurrency)
@reporter = ProgressReporter.new
end
def run
@reporter.update("Collecting listings...")
all_listings = collect_all_listings
writer = CheckpointWriter.new(resume_path)
processed = prepare_checkpoint!(writer, all_listings.map { |l| l[:id] }.to_set)
remaining = all_listings.reject { |l| processed.include?(l[:id]) }
@reporter.update("Processing #{remaining.size} listings...")
process_concurrently(remaining, writer, processed)
finalize!(writer)
@reporter.finish("Completed #{processed.size} listings")
end
end
The Checklist
Production scrapers need:
- [ ] Concurrent HTTP fetching with configurable worker count
- [ ] Semaphore-limited LLM calls
- [ ] Checkpoint file for resume after failure
- [ ] Checkpoint validation against current source data
- [ ] Retry logic with exponential backoff
- [ ] Progress reporting with elapsed time
- [ ] Graceful degradation for failed items
- [ ] Final output separate from checkpoint file
Build these patterns once. Reuse them across every scraping project.