Streaming Responses

RubyLLM provides streaming capabilities that allow you to receive AI responses in real-time as they’re being generated, rather than waiting for the complete response. This creates a more interactive experience and is especially useful for long responses or applications with real-time UI updates.

Basic Streaming

To stream responses, simply provide a block to the ask method:

chat = RubyLLM.chat

chat.ask "Write a short story about a programmer" do |chunk|
  # Each chunk contains a portion of the response
  print chunk.content
end

Understanding Chunks

Each streamed chunk is an instance of RubyLLM::Chunk (which inherits from RubyLLM::Message) and provides:

chunk.content       # The text fragment in this chunk
chunk.role          # Always :assistant for streamed chunks
chunk.model_id      # The model generating the response
chunk.input_tokens  # Input token count (usually only in the final chunk)
chunk.output_tokens # Output token count (usually only in the final chunk)

Accumulated Response

Even when streaming, RubyLLM still returns the complete final message:

final_message = chat.ask "Write a poem" do |chunk|
  print chunk.content
end

# You can use the final message as normal
puts "\nFinal message length: #{final_message.content.length}"
puts "Token usage: #{final_message.output_tokens} tokens"

Web Application Integration

Rails with ActionCable

# In your controller
def ask
  @chat = Chat.find(params[:id])

  @chat.ask(params[:message]) do |chunk|
    ActionCable.server.broadcast(
      "chat_#{@chat.id}",
      { content: chunk.content }
    )
  end

  head :ok
end

# In your JavaScript
const channel = consumer.subscriptions.create({ channel: "ChatChannel", id: chatId }, {
  received(data) {
    // Append incoming chunk to the display
    document.getElementById('response').innerHTML += data.content;
  }
});

Rails with Turbo Streams

class ChatJob < ApplicationJob
  queue_as :default

  def perform(chat_id, message)
    chat = Chat.find(chat_id)

    chat.ask(message) do |chunk|
      Turbo::StreamsChannel.broadcast_update_to(
        "chat_#{chat.id}",
        target: "response",
        html: chunk.content,
        append: true
      )
    end
  end
end

Sinatra with Server-Sent Events (SSE)

get '/chat/:id/ask' do
  content_type 'text/event-stream'

  chat = Chat.find(params[:id])

  chat.ask(params[:message]) do |chunk|
    # Send chunk as SSE event
    out << "data: #{chunk.content}\n\n"
  end

  # Send completion signal
  out << "event: complete\ndata: {}\n\n"
end

Error Handling

Errors that occur during streaming need special handling:

begin
  chat.ask("Tell me a story") do |chunk|
    print chunk.content
  end
rescue RubyLLM::Error => e
  puts "\nError during streaming: #{e.message}"
end

Common errors during streaming:

  • ServiceUnavailableError - The AI service is temporarily unavailable
  • RateLimitError - You’ve exceeded your API rate limit
  • BadRequestError - There was a problem with your request parameters

Provider-Specific Considerations

OpenAI

OpenAI’s streaming implementation provides small, frequent chunks for a smooth experience.

Anthropic

Claude models may return slightly larger chunks with potentially longer pauses between them.

Google Gemini

Gemini streaming is highly responsive but may show slightly different chunking behavior.

Streaming with Tools

When using tools, streaming works a bit differently:

chat.with_tool(Calculator)
   .ask("What's 123 * 456?") do |chunk|
     # Tool call execution isn't streamed
     # You'll receive chunks after tool execution completes
     print chunk.content
   end

The tool call execution introduces a pause in the streaming, as the model waits for the tool response before continuing.

Performance Considerations

Streaming typically uses the same number of tokens as non-streaming responses but establishes longer-lived connections to the AI provider. Consider these best practices:

  1. Set appropriate timeouts for streaming connections
  2. Handle network interruptions gracefully
  3. Consider background processing for long-running streams
  4. Implement rate limiting to avoid overwhelming your servers

Tracking Token Usage

Token usage information is typically only available in the final chunk or completed message:

total_tokens = 0

chat.ask("Write a detailed explanation of quantum computing") do |chunk|
  print chunk.content

  # Only count tokens in the final chunk
  if chunk.output_tokens
    total_tokens = chunk.input_tokens + chunk.output_tokens
  end
end

puts "\nTotal tokens: #{total_tokens}"

Custom Processing of Streamed Content

You can process streamed content in real-time:

accumulated_text = ""

chat.ask("Write a list of 10 fruits") do |chunk|
  new_content = chunk.content
  accumulated_text += new_content

  # Count fruits as they come in
  if new_content.include?("\n")
    fruit_count = accumulated_text.scan(/\d+\./).count
    print "\rFruits listed: #{fruit_count}/10"
  end
end

Rails Integration

When using RubyLLM’s Rails integration with acts_as_chat, streaming still works seamlessly:

class Chat < ApplicationRecord
  acts_as_chat
end

chat = Chat.create!(model_id: 'gpt-4o-mini')

# Stream responses while persisting the final result
chat.ask("Tell me about Ruby") do |chunk|
  ActionCable.server.broadcast("chat_#{chat.id}", { content: chunk.content })
end

# The complete message is saved in the database
puts chat.messages.last.content

Next Steps

Now that you understand streaming, you might want to explore: