Structured Data Extraction with LLM JSON Schemas

This is part 2 of the AI Recruiting Pipeline Epic.

The traditional approach to LLM extraction is prompting for prose and parsing it:

Extract the shift from this job description. Return "1st", "2nd", "3rd", or "Varies".

This fails in production because LLMs don't follow instructions precisely. They add context, hedging, or explanation:

"Based on the description mentioning 'day shift hours', this appears to be a 1st shift position."

Now you're parsing natural language output to extract structured data — exactly what you were trying to avoid.

JSON Schemas Fix This

OpenAI's structured outputs enforce a schema on the response. The model can only return valid JSON matching your specification:

def response_format
  {
    type: "json_schema",
    json_schema: {
      name: "job_description_schema",
      strict: true,
      schema: {
        type: "object",
        properties: {
          shift: {
            type: "string",
            enum: ["1st", "2nd", "3rd", "Varies", "Unknown"]
          },
          hours_per_week: {
            type: ["integer", "null"]
          },
          benefit_eligible: {
            type: "boolean"
          }
        },
        required: ["shift", "hours_per_week", "benefit_eligible"],
        additionalProperties: false
      }
    }
  }
end

The response is guaranteed to be valid JSON with exactly the structure you need. Parse it directly into your domain:

result = JSON.parse(provider.content)
job_listing.update!(
  shift: result["shift"],
  hours_per_week: result["hours_per_week"],
  benefit_eligible: result["benefit_eligible"]
)

Designing the Schema

Schema design determines extraction quality. Key principles:

1. Use Enums for Categorical Fields

{
  type: "string",
  enum: ["Full-time", "Part-time", "PRN", "None", "Unknown"]
}

The model must pick from your options. No creative interpretations.

2. Allow Nulls for Optional Information

{
  type: ["integer", "null"],
  description: "Hours per week if explicitly stated, null otherwise"
}

This prevents the model from inventing data when the source is ambiguous.

3. Use Arrays for Multi-valued Fields

{
  required_credentials: {
    type: "array",
    items: {
      type: "object",
      properties: {
        name: { type: "string" },
        evidence: { type: ["string", "null"] }
      },
      required: ["name", "evidence"]
    }
  }
}

The evidence field captures why the model made the extraction — useful for debugging.

4. Descriptions Guide Interpretation

{
  education_requirements: {
    type: "array",
    items: {
      type: "object",
      properties: {
        name: {
          type: "string",
          enum: ["HS", "Associate", "BA/BS", "MA/MS", "PhD", "None"]
        }
      }
    },
    description: "Required education levels explicitly stated in the posting. Leave empty if not specified."
  }
}

The description tells the model when to include entries versus leaving the array empty.

The System Prompt

The system prompt sets extraction rules:

def system_prompt
  <<~PROMPT
    You extract structured data from healthcare job descriptions.

    Rules:
    - Return only facts stated in the posting. Do not infer or guess.
    - Extract raw credential mentions exactly as written (e.g., "RN", "BLS", "CPR").
    - Use null for fields without explicit information.
    - Leave arrays empty when the posting does not provide relevant data.
  PROMPT
end

The rules prevent common failure modes:
- No inference — Models love to "help" by guessing
- Preserve original text — Don't normalize until you have explicit mappings
- Explicit nulls — Make absence of data explicit

Handling Extraction Failures

Even with schemas, extraction can fail:

def call_with_result
  provider = build_provider
  provider.call

  return rate_limited_result if provider.rate_limited?
  return error_result(provider.error) if provider.error

  content = provider.content
  return empty_result("empty_response") if content.blank?

  Result.new(
    data: JSON.parse(content),
    status: "success"
  )
rescue JSON::ParserError => e
  Result.new(
    data: empty_payload,
    status: "invalid_json",
    error_message: e.message
  )
end

The Result struct lets callers distinguish between successful extraction and various failure modes.

Retry Logic

Rate limits and transient failures need retries:

def call_with_retry(attempts: 3)
  attempts.times do |i|
    result = call_with_result
    return result if result.status == "success"

    if result.status == "rate_limited"
      sleep(result.retry_after || [i + 1, 5].min)
      next
    end

    return result if i == attempts - 1
  end
end

Each retry waits longer. Rate limits respect the API's requested delay.

Production Patterns

After running this system on ~10,000 job descriptions:

What works:
- Strict schemas eliminate parsing code entirely
- Evidence fields help debug misclassifications
- System prompt rules reduce hallucination significantly

What to watch:
- Token costs scale with input length — truncate verbose descriptions
- Rate limits hit hard with concurrent scraping — use semaphores
- Model updates can change behavior — pin versions and test regularly

The structured output approach turns LLM extraction from art into engineering. You define the contract, the model fills it, and you write zero parsing code.