Voice AI cost models — when 5,000 minutes of Vapi adds up to your margin

Voice AI looks cheap until you scale it. The per-minute economics of Vapi, ElevenLabs, and STT providers compound quickly. Here's the cost model that lets you price client retainers without bleeding.

AcquireOSApril 1, 20266 min read

A spreadsheet showing voice AI per-minute costs scaling against monthly volume

Note: this post replaces the date conflict with a duplicate same-day publish. Author note for ops only.

Voice AI looks cheap on the demo. Five cents a minute, sometimes less. The pitch math is irresistible — replace a $25/hour receptionist with a $0.05/minute AI that never takes a break.

The reality once you scale is that the per-minute cost is the floor of the cost stack, not the ceiling. By the time you add up Vapi's orchestration fee, the underlying LLM cost, the speech-to-text cost, the text-to-speech cost (especially with premium voices), the inbound/outbound carrier fees, and the call recording storage, the all-in cost per minute is 3-5x higher than the headline number.

For agency operators, that gap matters because it determines whether your $1,500/month retainer for a voice-AI receptionist is producing 60% margin or 5% margin. Here's the cost model that lets you price the retainer without bleeding.

The cost stack

A single one-minute Vapi call has costs at five layers:

1. LLM inference. The model that thinks during the call. Claude Haiku is roughly $0.0008 per 1K input tokens and $0.004 per 1K output tokens in 2026. A typical 3-4 minute call produces 6-12K tokens combined. Cost per minute: $0.005-0.018 depending on model.

2. Speech-to-text (STT). Converts the caller's audio to text in real time. Deepgram's standard tier runs roughly $0.004/minute. OpenAI Whisper is similar. Cost per minute: $0.004-0.005.

3. Text-to-speech (TTS). Converts the AI's response back to audio. This is where premium voices add up. ElevenLabs Flash v2.5 (the production-grade voice model) is approximately $0.075 per 1,000 characters at standard tier pricing in 2026. A typical 4-minute call has the AI speaking ~600-900 characters per minute. Cost per minute: $0.045-0.067.

4. Vapi orchestration. The platform fee for orchestrating the call. Standard pricing is $0.05/minute. This is the headline number most operators see and assume is the total.

5. Carrier fees. The phone-number layer. Inbound is roughly $0.013/minute on Twilio-equivalent pricing; outbound is $0.013-0.015/minute. Call-tracking numbers add a per-number monthly fee plus pass-through usage.

All-in per-minute cost: roughly $0.12-0.16 for a production-quality voice agent with premium voices and Claude Haiku underneath. Cheaper if you use lower-quality voices and lower-tier models. More expensive if you use Sonnet or Opus, GPT-4o, or have heavy STT/TTS usage.

The headline $0.05 Vapi fee is closer to one-third of the actual cost.

What 5,000 minutes/month looks like

Take a mid-sized HVAC client with a reasonable inbound call volume:

25 calls/day average, 30 days = 750 calls
Average call length: 4 minutes
Total minutes/month: ~3,000

For this client, voice AI costs run roughly $360-480/month at the all-in rate. If the operator charges $1,500/month for receptionist services, the gross margin after voice costs is 68-76%. That works.

Now scale up to a larger client (a multi-location HVAC, dental group, or high-volume restaurant):

80 calls/day average = 2,400 calls/month
Average call length: 4.5 minutes
Total minutes/month: ~10,800

Voice AI cost: $1,300-1,700/month. If the retainer is $1,500/month, the operator is losing money before any other delivery costs. The same retainer that produced 70% margin on the small client produces -10% on the large one.

This is the failure mode that catches most operators in their first scale moment. The cost model worked at small scale and silently inverted at larger volume.

The pricing implications

Three responses to the cost asymmetry, and the right answer depends on the operator's positioning.

Response 1: Tiered minute allowances

Bundle a baseline minute allowance into the retainer; charge metered for overage.

| Tier | Retainer | Included minutes | Overage | |---|---|---|---| | Base | $1,200 | 2,500 | $0.18/min | | Pro | $1,800 | 5,000 | $0.16/min | | Volume | $2,800 | 10,000 | $0.14/min |

The math: at 2,500 minutes the base tier produces $1,200 - $400 = $800 contribution. At overage rates, the per-minute spread covers your cost plus a small margin. This is the cleanest model and the one most operators should default to.

Response 2: Charge per call, not per minute

Some niches care about call volume more than call length. Per-call pricing eliminates the ambiguity of long calls and aligns the metric with what the client cares about.

A $4 per-handled-call price at an average 4-minute call length covers your cost (~$0.55) with comfortable margin. Clients evaluate this against the cost of a human receptionist (~$10-15 per call after overhead) and the per-call price feels reasonable.

The downside: per-call pricing requires defining what counts as a "handled call." Calls that ring through and disconnect at hello? Calls that make it to qualification but don't book? Calls that book? You need clear definitions, and the agreement-text becomes more involved.

Response 3: Pass-through with markup

For high-volume clients especially, operators sometimes pass voice costs through at a transparent markup (cost + 25-40%). The operator's retainer covers everything else (setup, configuration, monitoring, optimization), and the voice usage is its own line.

This model works well for sophisticated clients who want transparency. It works less well for clients who want a simple all-in fee — they'll perceive the variable line as nickel-and-diming.

The model selection lever

The single biggest cost lever is which LLM is underneath the voice agent.

| Model | Cost per minute (LLM only) | Use case | |---|---|---| | Claude Haiku 4.5 | $0.005-0.012 | Standard receptionist, basic qualification | | GPT-4o-mini | $0.005-0.011 | Equivalent quality, similar cost | | Claude Sonnet 4.5 | $0.025-0.045 | Complex qualification, multi-step reasoning | | GPT-4o | $0.030-0.050 | Same range as Sonnet | | Claude Opus | $0.080-0.140 | Reserved for high-stakes calls only |

For 90% of receptionist work, Haiku-class models are the right answer. The capability gap to Sonnet doesn't justify the 3-5x cost in a phone conversation that mostly resolves to "do you want to book an appointment, what time, what's your address."

The model selection happens at the agent template level, not per-call. Most agency stacks should default to Haiku for inbound receptionist agents and reserve Sonnet for outbound sales agents where the per-call value justifies the cost.

The voice selection lever

ElevenLabs voices are the de facto premium voice in 2026 — they sound natural enough to fool most callers. The price is real. Lower-tier providers (Deepgram TTS, Azure neural voices) cost 5-10x less per character but the quality difference is audible.

For an agency serving service businesses where the call quality matters to client retention, the premium voice is non-negotiable — a robotic-sounding receptionist costs the operator client retention faster than it saves on TTS fees. For lower-stakes use cases (outbound voicemail drops, after-hours auto-attendant), the lower-tier voice is fine.

This is the case where the $0.04-0.06 per minute on TTS is well-spent margin, not waste.

The infrastructure leverage

The all-in per-minute cost of voice AI is roughly the cost of unmanaged voice infrastructure. Operators on platforms that handle the orchestration get partial economies of scale:

Pre-negotiated rates with Vapi, ElevenLabs, Deepgram
Shared model routing infrastructure (the LLM cost gets optimized at the platform level)
Smart fallback routing when one provider has an outage (avoids dropped calls and client incidents)

For an operator running 30 clients each at 5K minutes/month (150K total minutes), the platform leverage can be 15-25% on per-minute cost — turning a 60% gross margin into a 70% gross margin without changing the retainer price.

How AcquireOS handles voice economics

The platform bundles voice AI minutes by tier:

Operator tier: 400 minutes/month included, $0.16/min overage
Agency tier: 750 minutes/month included, $0.14/min overage
Partner tier: custom volume pricing

Per-minute costs are absorbed at the platform level, with operators seeing a single bundled fee. The model routing layer auto-selects between Haiku, Sonnet, and provider fallbacks based on the call type — operators don't have to think about whether to upgrade a particular agent. The result is that the voice economics work cleanly at every scale, instead of inverting on the client that grew faster than the operator's pricing model anticipated.

The principle: voice AI is not as cheap as the headline pitch. The operators who treat it as a $0.05/minute commodity build pricing that fails at scale. The operators who model the full cost stack and price tiers around real margins build pricing that holds. The math compounds; small assumptions early become large losses at volume.

#voice-ai#vapi#economics#pricing

AcquireOS

The AI agency operating system. Playbooks, case studies, and deep-dives written by the team building the platform agency operators run on.

Ready to run this inside your agency?

Book a call. We'll walk you through how AcquireOS finds the clients, deploys the agents, and proves the ROI — so you can focus on closing.

Book a call

Keep reading

A bar chart showing a steep Pareto distribution of client revenue contribution

Industry

Operator economics: the 80/20 of which clients drive 80% of MRR

The Pareto distribution shows up brutally in agency books. The 20% of clients driving 80% of revenue are usually a very specific type — and the 80% you're tolerating are the ones eating your margin.

AcquireOSApr 6, 20265 min read

Network of connected nodes representing different agent types

Industry

The AI agent classification framework: receptionist vs SDR vs assistant vs orchestrator

Most operators deploy an 'AI agent' without being clear about which kind. The four agent classes — receptionist, SDR, assistant, orchestrator — have fundamentally different design constraints, training requirements, and failure modes. Here's the framework.

AcquireOSApr 27, 20267 min read

Legal documents stacked on a desk with a fountain pen

Industry

TCPA, A2P, CAN-SPAM, GDPR: the 4 compliance frameworks every agency operator needs in 2026

Compliance is the silent killer of agency operators. The fines for one TCPA violation can exceed a year of agency revenue. Here are the four frameworks that govern agency outbound work in 2026, what each one actually requires, and the architectural decisions that keep you safe.

AcquireOSApr 26, 20268 min read