Audio Transcription

v1.9.0+

Convert speech to text with support for multiple languages and speaker diarization

Basic Transcription
Choosing Models
Language Hints
Speaker Diarization
1. Identifying Known Speakers
Improving Accuracy with Prompts
1. Gemini prompt tips
Segments and Timestamps
Handling Longer Files
Error Handling
Next Steps

After reading this guide, you will know:

How to transcribe audio files to text.
How to identify different speakers with diarization.
How to improve accuracy with language hints and prompts.
How to access segments and timestamps.

Basic Transcription

Transcribe audio with the global RubyLLM.transcribe method:

transcription = RubyLLM.transcribe("meeting.wav")

puts transcription.text
# => "Welcome to today's meeting. Let's discuss..."

puts transcription.model
# => "whisper-1"

Supports MP3, M4A, WAV, WebM, OGG, and more.

Choosing Models

# Whisper-1 (default, good for general use)
RubyLLM.transcribe("audio.mp3", model: "whisper-1")

# GPT-4o Transcribe (faster, better for technical content)
RubyLLM.transcribe("audio.mp3", model: "gpt-4o-transcribe")

# GPT-4o Mini Transcribe (fastest, lowest cost)
RubyLLM.transcribe("audio.mp3", model: "gpt-4o-mini-transcribe")

# Diarization model (identifies speakers)
RubyLLM.transcribe("meeting.wav", model: "gpt-4o-transcribe-diarize")

# Gemini 2.5 Flash/Pro (Google's multimodal transcription)
RubyLLM.transcribe(
  "lecture.wav",
  model: "gemini-2.5-flash",
  prompt: "Return only the verbatim transcript."
)

Configure the default globally:

RubyLLM.configure do |config|
  config.default_transcription_model = "gpt-4o-transcribe"
end

Language Hints

Improve accuracy by specifying the language:

RubyLLM.transcribe("entrevista.mp3", language: "es")
RubyLLM.transcribe("conference.mp3", language: "fr")

Use ISO 639-1 codes (en, es, fr, de, etc.).

Speaker Diarization

The diarization model identifies different speakers:

transcription = RubyLLM.transcribe(
  "team-meeting.wav",
  model: "gpt-4o-transcribe-diarize"
)

transcription.segments.each do |segment|
  puts "#{segment['speaker']}: #{segment['text']}"
  puts "  (#{segment['start']}s - #{segment['end']}s)"
end
# Output:
# A: Hi everyone.
#   (0.5s - 1.2s)
# B: Happy to be here.
#   (2.8s - 3.5s)

Identifying Known Speakers

Provide 2-10 second reference clips to map speakers to names:

transcription = RubyLLM.transcribe(
  "team-meeting.wav",
  model: "gpt-4o-transcribe-diarize",
  speaker_names: ["Alice", "Bob"],
  speaker_references: ["alice-voice.wav", "bob-voice.wav"]
)

# Now segments use the provided names
# Alice: Hi everyone.
# Bob: Happy to be here.

Speaker references accept file paths, URLs, IO objects, or ActiveStorage attachments.

Note: Gemini models currently return plain text transcripts without segment metadata. Use OpenAI’s diarization models when you need speaker labels or timestamps.

Improving Accuracy with Prompts

Guide the model with context about technical terms or domain-specific vocabulary:

RubyLLM.transcribe(
  "developer-talk.mp3",
  prompt: "Discussion about Ruby, Rails, PostgreSQL, and Redis."
)

RubyLLM.transcribe(
  "product-demo.mp3",
  prompt: "Product demo for ZyntriQix, Digique Plus, and CynapseFive."
)

Gemini prompt tips

Gemini treats transcription requests like any other conversation. Use the prompt: argument to steer formatting (for example, “Respond with plain text only.”), and combine it with language: when you want a specific locale in the final transcript. RubyLLM automatically adds the language hint to the Gemini request.

Segments and Timestamps

Access detailed timing information:

transcription = RubyLLM.transcribe("interview.mp3", model: "gpt-4o-transcribe")

puts "Duration: #{transcription.duration} seconds"

transcription.segments.each do |segment|
  puts "#{segment['start']}s - #{segment['end']}s: #{segment['text']}"
end

Handling Longer Files

The default timeout is 5 minutes. Increase it for longer audio:

RubyLLM.configure do |config|
  config.request_timeout = 600 # 10 minutes
end

The API supports files up to 25 MB. For larger files, use compressed formats (MP3, M4A) or split into chunks.

Error Handling

begin
  transcription = RubyLLM.transcribe("audio.mp3")
  puts transcription.text
rescue RubyLLM::BadRequestError => e
  puts "Invalid request: #{e.message}"
rescue RubyLLM::TimeoutError => e
  puts "Transcription timed out: #{e.message}"
rescue RubyLLM::Error => e
  puts "Transcription failed: #{e.message}"
end

Next Steps

Chatting with AI Models: Learn about conversational AI.
Image Generation: Generate images from text.
Error Handling: Master handling API errors.