Scale with Async
Scale your AI features without breaking the bank. Efficient concurrency on a single server.
Table of contents
- Why Async for LLMs?
- How RubyLLM Works with Async
- Concurrent Operations
- Background Processing with
Async::Job
- Rate Limiting with Semaphores
- Summary
After reading this guide, you will know:
- Why LLM applications benefit dramatically from async Ruby
- How RubyLLM automatically works with async
- How to perform concurrent LLM operations
- How to use async-job for background processing
- How to handle rate limits with semaphores
For a deeper dive into Async, Threads, and why Async Ruby is perfect for LLM applications, including benchmarks and architectural comparisons, check out my blog post: Async Ruby is the Future of AI Apps (And It’s Already Here)
Why Async for LLMs?
LLM operations are unique - they take 5-60 seconds and spend 99% of that time waiting for tokens to stream back. Using traditional thread-based job queues (Sidekiq, GoodJob, SolidQueue) for LLM operations creates a problem:
# With 25 worker threads configured:
class ChatResponseJob < ApplicationJob
def perform(conversation_id, message)
# This occupies 1 of your 25 slots for 30-60 seconds...
response = RubyLLM.chat.ask(message)
# ...even though the thread is 99% idle
end
end
# Your 26th user? They're waiting in line.
Async solves this by using fibers instead of threads:
- Threads: OS-managed, preemptive, heavy (each needs its own database connection)
- Fibers: Userspace, cooperative, lightweight (thousands can share a few connections)
How RubyLLM Works with Async
The beautiful part: RubyLLM automatically becomes non-blocking when used in an async context. No configuration needed.
require 'async'
require 'ruby_llm'
# This is all you need for concurrent LLM calls
Async do
10.times.map do
Async do
# RubyLLM automatically becomes non-blocking
# because Net::HTTP knows how to yield to fibers
message = RubyLLM.chat.ask "Explain quantum computing"
puts message.content
end
end.map(&:wait)
end
This works because RubyLLM uses Net::HTTP
, which cooperates with Ruby’s fiber scheduler.
Concurrent Operations
Multiple Chat Requests
Process multiple questions concurrently:
require 'async'
require 'ruby_llm'
def process_questions(questions)
Async do
tasks = questions.map do |question|
Async do
response = RubyLLM.chat.ask(question)
{ question: question, answer: response.content }
end
end
# Wait for all tasks and return results
tasks.map(&:wait)
end.result
end
questions = [
"What is Ruby?",
"Explain metaprogramming",
"What are symbols?"
]
results = process_questions(questions)
results.each do |result|
puts "Q: #{result[:question]}"
puts "A: #{result[:answer]}\n\n"
end
Batch Embeddings
Generate embeddings efficiently:
def generate_embeddings(texts, batch_size: 100)
Async do
embeddings = []
texts.each_slice(batch_size) do |batch|
task = Async do
response = RubyLLM.embed(batch)
response.vectors
end
embeddings.concat(task.wait)
end
# Return text-embedding pairs
texts.zip(embeddings)
end.result
end
texts = ["Ruby is great", "Python is good", "JavaScript is popular"]
pairs = generate_embeddings(texts)
pairs.each do |text, embedding|
puts "#{text}: #{embedding[0..5]}..." # Show first 6 dimensions
end
Parallel Analysis
Run multiple analyses concurrently:
def analyze_document(content)
Async do
summary_task = Async do
RubyLLM.chat.ask("Summarize in one sentence: #{content}")
end
sentiment_task = Async do
RubyLLM.chat.ask("Is this positive or negative: #{content}")
end
{
summary: summary_task.wait.content,
sentiment: sentiment_task.wait.content
}
end.result
end
result = analyze_document("Ruby is an amazing language with a wonderful community!")
puts "Summary: #{result[:summary]}"
puts "Sentiment: #{result[:sentiment]}"
Background Processing with Async::Job
The real power comes from using Async::Job
for background processing. Unlike traditional thread-based job processors that get blocked during long LLM operations, Async::Job
uses fibers to handle thousands of concurrent jobs efficiently.
Setup with Falcon (Recommended)
Falcon is a Ruby application server built on fibers. With Falcon, async just works™.
# Gemfile
gem 'falcon'
gem 'async-job-adapter-active_job'
# config/application.rb
config.active_job.queue_adapter = :async_job
# config/initializers/async_job_adapter.rb
require 'async/job/processor/inline'
Rails.application.configure do
config.async_job.define_queue "default" do
dequeue Async::Job::Processor::Inline
end
end
That’s it. Start your server with bin/dev
and enjoy concurrent job processing. Your jobs now run concurrently without any additional infrastructure. One process, thousands of concurrent LLM operations.
Note on Puma
Still using Puma? You’ll need a Redis-backed job processor for concurrent execution:
# Gemfile additions
gem 'async-job-processor-redis'
# config/initializers/async_job_adapter.rb
require 'async/job/processor/redis'
Rails.application.configure do
config.async_job.define_queue "default" do
dequeue Async::Job::Processor::Redis
end
end
Then run these processes:
Option 1: Add to Procfile.dev (Recommended)
# Procfile.dev
web: bin/rails server
css: bin/rails tailwindcss:watch # or your CSS processor
redis: redis-server
async_job: bundle exec async-job-adapter-active_job-server
Then just run bin/dev
to start everything.
Option 2: Separate terminals
# Terminal 1: Redis
redis-server
# Terminal 2: Job processor (auto-scales to CPU cores)
bundle exec async-job-adapter-active_job-server
# Terminal 3: Rails
bin/dev
This setup requires more infrastructure but still delivers the concurrency benefits of async for your LLM operations.
Your Jobs Work Unchanged
Here’s the key insight: you don’t need to modify your jobs at all. Async::Job
runs each job inside an async context automatically:
class DocumentAnalyzerJob < ApplicationJob
def perform(document_id)
document = Document.find(document_id)
# This automatically runs in an async context!
# No need to wrap in Async blocks
response = RubyLLM.chat.ask("Analyze: #{document.content}")
document.update!(
analysis: response.content,
analyzed_at: Time.current
)
end
end
Mixing Job Adapters: Best of Both Worlds
You don’t have to go all-in. Use async-job only for LLM operations while keeping your existing job processor for everything else:
# Keep your existing adapter as default
config.active_job.queue_adapter = :solid_queue # or :sidekiq, :good_job, etc.
# Base class for all LLM jobs
class LLMJob < ApplicationJob
self.queue_adapter = :async_job
end
# LLM jobs inherit the async adapter
class ChatResponseJob < LLMJob
def perform(conversation_id, message)
# Runs with async-job - perfect for streaming
response = RubyLLM.chat.ask(message)
# ...
end
end
# Regular jobs use your default adapter
class ImageProcessingJob < ApplicationJob
def perform(image_id)
# Runs with solid_queue - better for CPU work
# ...
end
end
This approach lets you optimize each job type for its workload without disrupting your existing infrastructure.
Rate Limiting with Semaphores
When making many concurrent requests, use a semaphore to respect rate limits:
require 'async'
require 'async/semaphore'
class RateLimitedProcessor
def initialize(max_concurrent: 10)
@semaphore = Async::Semaphore.new(max_concurrent)
end
def process_items(items)
Async do
items.map do |item|
Async do
# Only 10 items processed at once
@semaphore.acquire do
response = RubyLLM.chat.ask("Process: #{item}")
{ item: item, result: response.content }
end
end
end.map(&:wait)
end.result
end
end
# Usage
processor = RateLimitedProcessor.new(max_concurrent: 5)
items = ["Item 1", "Item 2", "Item 3", "Item 4", "Item 5", "Item 6"]
results = processor.process_items(items)
The semaphore ensures only 5 requests run concurrently, preventing rate limit errors while still maintaining high throughput.
Summary
Key takeaways:
- LLM operations are perfect for async (99% waiting for I/O)
- RubyLLM automatically works with async - no configuration needed
- Use async-job for LLM background jobs without changing your job code
- Use semaphores to manage rate limits
- Keep thread-based processors for CPU-intensive work
The combination of RubyLLM and async Ruby gives you the ability to handle thousands of concurrent AI conversations on modest hardware - something that would require massive infrastructure with traditional thread-based approaches.
Ready to dive deeper? Read the full architectural comparison: Async Ruby is the Future of AI Apps