xAI Launches Grok STT and TTS APIs, Outperforming Rivals on Speech Accuracy
Summary
- • xAI released standalone Grok Speech-to-Text and Text-to-Speech APIs on April 17, 2026
- • Grok STT achieves 6.9% overall WER, beating ElevenLabs (9.0%), Deepgram (11.0%), and AssemblyAI (12.9%)
- • Phone call accuracy is 5.0% WER vs 12.0% for ElevenLabs — the largest competitive gap in benchmarks
- • Pricing: $0.10/hour batch, $0.20/hour streaming; supports 25+ languages, diarization, and expressive TTS tags
Details
xAI released Grok STT and TTS APIs on April 17, 2026
Both APIs are built on the same audio stack already powering Grok Voice, Tesla vehicles, and Starlink customer support — shipping with real-world production validation at scale rather than being greenfield launches.
Grok STT supports word-level timestamps, diarization, 25+ languages, and Inverse Text Normalization
REST API for batch processing and WebSocket API for real-time streaming. Intelligent Inverse Text Normalization automatically handles numbers, dates, and currencies — a common pain point in business transcription workflows for medical, legal, and financial applications.
Grok STT: 6.9% overall WER vs ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%
Phone call accuracy is 5.0% for Grok vs 12.0% for ElevenLabs — more than 2x improvement in the most business-critical audio domain. Meeting transcription is 10.9% vs ElevenLabs 12.2%. Video/podcasts tied with ElevenLabs at 2.4%.
Pricing: $0.10/hour batch transcription, $0.20/hour real-time streaming
Straightforward usage-based pricing. Batch rate is competitive for high-volume offline transcription workloads; streaming rate targets live voice agent and real-time transcription use cases.
Grok TTS supports expressive speech tags for fine-grained prosody control
Tags include [laugh], [sigh], [whisper], emphasis, slow, and pause — giving developers fine-grained vocal delivery control for voice agents, podcasts, and interactive audio experiences without complex markup.
xAI directly targets enterprise verticals — medical, legal, financial — via entity recognition accuracy
Grok STT's benchmark lead on phone call audio and Intelligent Inverse Text Normalization make it a credible option for regulated-industry transcription, putting xAI in direct competition with established players AssemblyAI and Deepgram.
Product Launch = new API release; New Tech = specific technical capabilities; Stat = benchmark performance figure; Financials = pricing; Market Impact = competitive and industry implications
What This Means
xAI is moving fast to monetize its audio infrastructure by opening APIs already battle-tested in demanding production environments, giving developers access to speech models with a compelling accuracy advantage over established players. The benchmark results — especially 5.0% WER on phone calls versus competitors in the 12% range — make Grok STT a serious option for any team building voice agents or transcription pipelines for business audio. Builders in medical, legal, and financial sectors should evaluate Grok STT given its entity recognition focus and the high cost of transcription errors in those domains.
