← Back to feed
7

xAI Launches Grok STT and TTS APIs, Outperforming Rivals on Speech Accuracy

Products1 source·1d ago

Summary

  • • xAI released standalone Grok Speech-to-Text and Text-to-Speech APIs on April 17, 2026
  • • Grok STT achieves 6.9% overall WER, beating ElevenLabs (9.0%), Deepgram (11.0%), and AssemblyAI (12.9%)
  • • Phone call accuracy is 5.0% WER vs 12.0% for ElevenLabs — the largest competitive gap in benchmarks
  • • Pricing: $0.10/hour batch, $0.20/hour streaming; supports 25+ languages, diarization, and expressive TTS tags
Adjust signal

Details

1.Product Launch

xAI released Grok STT and TTS APIs on April 17, 2026

Both APIs are built on the same audio stack already powering Grok Voice, Tesla vehicles, and Starlink customer support — shipping with real-world production validation at scale rather than being greenfield launches.

2.New Tech

Grok STT supports word-level timestamps, diarization, 25+ languages, and Inverse Text Normalization

REST API for batch processing and WebSocket API for real-time streaming. Intelligent Inverse Text Normalization automatically handles numbers, dates, and currencies — a common pain point in business transcription workflows for medical, legal, and financial applications.

3.Stat

Grok STT: 6.9% overall WER vs ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%

Phone call accuracy is 5.0% for Grok vs 12.0% for ElevenLabs — more than 2x improvement in the most business-critical audio domain. Meeting transcription is 10.9% vs ElevenLabs 12.2%. Video/podcasts tied with ElevenLabs at 2.4%.

4.Financials

Pricing: $0.10/hour batch transcription, $0.20/hour real-time streaming

Straightforward usage-based pricing. Batch rate is competitive for high-volume offline transcription workloads; streaming rate targets live voice agent and real-time transcription use cases.

5.New Tech

Grok TTS supports expressive speech tags for fine-grained prosody control

Tags include [laugh], [sigh], [whisper], emphasis, slow, and pause — giving developers fine-grained vocal delivery control for voice agents, podcasts, and interactive audio experiences without complex markup.

6.Market Impact

xAI directly targets enterprise verticals — medical, legal, financial — via entity recognition accuracy

Grok STT's benchmark lead on phone call audio and Intelligent Inverse Text Normalization make it a credible option for regulated-industry transcription, putting xAI in direct competition with established players AssemblyAI and Deepgram.

Product Launch = new API release; New Tech = specific technical capabilities; Stat = benchmark performance figure; Financials = pricing; Market Impact = competitive and industry implications

What This Means

xAI is moving fast to monetize its audio infrastructure by opening APIs already battle-tested in demanding production environments, giving developers access to speech models with a compelling accuracy advantage over established players. The benchmark results — especially 5.0% WER on phone calls versus competitors in the 12% range — make Grok STT a serious option for any team building voice agents or transcription pipelines for business audio. Builders in medical, legal, and financial sectors should evaluate Grok STT given its entity recognition focus and the high cost of transcription errors in those domains.

Sources

Similar Events