LLM Provider Abstraction: Design for Swappability

This is part 1 of the AI Recruiting Pipeline Epic.

When you first integrate an LLM, the temptation is to call the API directly:

client = OpenAI::Client.new(access_token: ENV["OPENAI_API_KEY"])
response = client.chat(parameters: { model: "gpt-4o", messages: [...] })

This works until:

You need to switch from GPT-4o to Claude for better reasoning
OpenAI rate-limits you and you need automatic retry logic
You want to test without making real API calls
A new model launches and you want to A/B test it

The fix is a provider abstraction that separates what you want from how it happens.

The Base Class

# app/services/llm/provider/base.rb
module LLM
  module Provider
    class Base
      attr_reader :messages, :response_format, :model, :max_tokens, :temperature
      attr_reader :response, :error

      def initialize(messages:, response_format: nil, model: nil, max_tokens: nil, temperature: nil)
        @messages = messages
        @response_format = response_format
        @model = model
        @max_tokens = max_tokens
        @temperature = temperature
      end

      def call
        raise NotImplementedError
      end

      def success?
        response.present? && error.nil?
      end

      def content
        raise NotImplementedError
      end

      def rate_limited?
        false
      end

      def retry_after
        nil
      end
    end
  end
end

The base class defines the interface every provider must implement:

call — Make the API request
success? — Did it work?
content — Extract the response text
rate_limited? — Should we wait and retry?
retry_after — How long to wait?

The OpenAI Provider

# app/services/llm/provider/openai.rb
module LLM
  module Provider
    class OpenAI < Base
      RATE_LIMIT_PATTERN = /Please try again in ([0-9.]+)(ms|s)/i

      def call
        @response = client.chat(parameters: request_parameters)
        self
      rescue Faraday::Error, StandardError => e
        @error = e
        self
      end

      def content
        return nil unless response
        response.dig("choices", 0, "message", "content")&.strip
      end

      def rate_limited?
        return false unless error
        error_code == "rate_limit_exceeded" || error.is_a?(Faraday::TooManyRequestsError)
      end

      def retry_after
        return nil unless error
        retry_after_from_headers || retry_after_from_message
      end

      private

      def client
        @client ||= ::OpenAI::Client.new(
          access_token: Rails.application.credentials.dig(:open_ai, :api_key),
          request_timeout: 120
        )
      end

      def request_parameters
        params = { model:, messages: }
        params[:response_format] = response_format if response_format
        params[:max_tokens] = max_tokens if max_tokens
        params[:temperature] = temperature if temperature
        params
      end

      def retry_after_from_headers
        error.response&.dig(:headers, "retry-after")&.to_f
      end

      def retry_after_from_message
        match = error.message.to_s.match(RATE_LIMIT_PATTERN)
        return nil unless match
        value, unit = match[1].to_f, match[2].downcase
        unit == "ms" ? value / 1000.0 : value
      end
    end
  end
end

This encapsulates all OpenAI-specific behavior: response parsing, error handling, rate limit detection.

Using the Abstraction

Services that need LLM calls accept the provider class as a dependency:

class ParseJobDescription
  def initialize(text, provider_class: LLM::Provider::OpenAI)
    @text = text
    @provider_class = provider_class
  end

  def call
    provider = @provider_class.new(
      messages: build_messages,
      response_format: json_schema,
      model: "gpt-4o",
      temperature: 0
    )
    provider.call

    return fallback_result unless provider.success?
    JSON.parse(provider.content)
  end
end

Testing becomes trivial:

class FakeProvider < LLM::Provider::Base
  def initialize(response_content:, **options)
    super(**options)
    @response_content = response_content
  end

  def call
    @response = { "choices" => [{ "message" => { "content" => @response_content } }] }
    self
  end

  def content
    @response_content
  end
end

RSpec.describe ParseJobDescription do
  it "parses the response" do
    fake_provider = Class.new(FakeProvider) do
      def initialize(**); super(response_content: '{"shift": "1st"}', **); end
    end

    result = described_class.new("Day shift position", provider_class: fake_provider).call
    expect(result["shift"]).to eq "1st"
  end
end

Adding a Second Provider

When Claude's reasoning is better for your use case:

module LLM
  module Provider
    class Claude < Base
      def call
        @response = client.messages(
          model:,
          max_tokens:,
          messages:
        )
        self
      rescue Anthropic::Error => e
        @error = e
        self
      end

      def content
        response&.content&.first&.text
      end

      private

      def client
        @client ||= Anthropic::Client.new
      end
    end
  end
end

Switching providers is a one-line change:

ParseJobDescription.new(text, provider_class: LLM::Provider::Claude)

Or configure per-environment:

# config/initializers/llm.rb
Rails.configuration.default_llm_provider =
  ENV["LLM_PROVIDER"] == "claude" ? LLM::Provider::Claude : LLM::Provider::OpenAI

Why This Pattern Matters

Testability — Fake providers run in milliseconds with no API costs
Reliability — Rate limit handling is implemented once, correctly
Flexibility — Switch models for A/B testing or cost optimization
Observability — Add logging, metrics, or tracing in one place

The provider abstraction is table stakes for production AI features. Build it before your first LLM call, not after you're rate-limited in production.