xAI is pushing Grok deeper into voice. The company says Grok's voice engine is now the default behind Vapi's 12 core voices — the text-to-speech for what it describes as more than 2.5 million voice agents built on the Vapi platform.
The pitch is naturalness and emotional range. xAI points to a blind evaluation run by Vapi, its partner rather than an independent third party, in which it says Grok Voice took the top spot against other providers. It also cites a poll on X where, it says, more than 4,500 people split roughly 50/50 trying to tell a Grok voice clone from the human original.
What developers get
For teams already on Vapi, Grok now appears as a selectable text-to-speech option in the dashboard, alongside Grok speech-to-text. Directly, xAI's Voice API offers five built-in voices, transcription in 25 languages, and custom voice cloning from a reference clip of up to two minutes — pitched for narration, podcasts, advertising, and voiceover.
xAI lists text-to-speech at $15 per million characters, real-time voice agents at $3 an hour, and transcription at 10 to 20 cents an hour, with what it calls sub-second latency. It says voice audio is processed in real time and never stored or used for training, and points to SOC 2, HIPAA, and GDPR coverage for enterprise use.
Why it matters
Voice is becoming a battleground for the frontier labs, and bundling a model into a large agent platform is a fast way to reach developers. The quality claims here are the companies' own — a partner-run blind test and a self-reported X poll — so treat them as marketing until independent comparisons land. The distribution is the real story: default placement across millions of agents matters more than any single benchmark.
