"AI receptionist" is a marketing phrase that means three or four different things depending on who's selling it. Some of them are useful. Some of them are slideware. This article is the concrete version - exactly what happens during an inbound call to a real AI receptionist, in the order it happens, with the parts that work and the parts that don't.
If you're an owner-operator evaluating one for your business, this is the version you actually need.
The setup (what's running before the phone rings)
Before any call comes in, four things are quietly running:
A voice agent - a large language model wrapped in a real-time voice layer. Most production deployments today use ElevenLabs, OpenAI's Realtime API, or Deepgram for the voice rendering plus GPT-4-class or Claude-class models for the reasoning.
A knowledge base - your hours, services, pricing logic, FAQ, escalation rules. Usually 5-20 pages of structured content the agent has been trained or prompt-conditioned on.
A telephony bridge - Twilio, Vonage, or a similar SIP provider routing the call from your existing business number into the agent. To the caller, they're calling the same number they always have.
Integration plumbing - webhooks into your CRM, calendar, and email so the agent can actually take action.
None of this is exotic. It's the standard 2026 stack. The variation is mostly in how cleanly the plumbing is wired and how good the knowledge base is.
What happens when the phone rings
Second 0. The call hits the SIP provider. The provider routes it to the agent.
Second 0.4. The agent picks up - faster than a human can. Standard latency in current deployments is 400-800ms before the first word is spoken. Anything over 1.5 seconds feels long; anything under 600ms feels eerily natural.
Second 1. The agent says its opening line. It's not "press 1 for sales." It's a normal greeting using your firm's name and the time of day. Example: "Bluestone Law, this is the after-hours line - what can I help you with?"
Seconds 2-60. This is the actual conversation. The agent listens, the model processes the speech-to-text, decides what's being asked, and responds. Latency per turn is 1.2-2.4 seconds - slightly slower than a human, but conversational. The caller can interrupt; the agent stops talking.
What the agent does well in this window:
- Answers factual questions ("what are your hours," "do you handle landlord-tenant disputes," "what's the consultation fee")
- Qualifies the caller ("are you looking for criminal or civil," "rough volume of matters per month")
- Books meetings into a real calendar with a real time slot
- Logs callback requests
- Escalates to a human cell phone if the caller says it's urgent
What the agent does poorly in this window:
- Anything that requires reading something the caller can't say out loud (a date on a document, a number on a screen)
- Complex multi-turn reasoning ("what would you charge if I had three matters at $X each, two were settled, and the third went to discovery?") - the model can answer but doesn't always commit
- Emotional de-escalation (a furious caller is still a human's job)
After the call hangs up
Second 1 after hangup. A transcript is generated from the audio. This is automatic and reliable - speech-to-text in 2026 is essentially solved for clean phone audio.
Second 5. A structured summary is generated. Caller name, phone number, intent, qualification answers, action taken (booked, escalated, info given), and any flags ("caller was upset," "asked about a service we don't offer").
Second 10. The CRM record is created or updated. Calendar invite is sent to the caller's email and to the business owner. A summary email is sent to whoever should know.
Second 15. If escalation rules triggered, a text message hits the owner's cell phone.
This sub-30-second back-end is where the actual ROI shows up. Not in the conversation itself. The conversation is the front door. The CRM record + calendar invite + summary email is what means you don't have to spend an hour on Monday morning reconstructing what happened over the weekend.
What the market charges (May 2026 benchmarks)
| Tier | Who builds it | Build cost | Quality bar |
|---|---|---|---|
| DIY off-the-shelf | Owner configs Bland.ai / Synthflow / Voiceflow | $0-3K + subscription | Generic, fragile, no integration |
| Solo freelancer | One developer | $3K-$12K CAD | Decent if they're good |
| Small dev shop | 3-10 person studio | $6K-$23K CAD | Variable |
| Boutique AI agency | Compass-tier specialists | $8K-$30K CAD | Premium, scoped, operator-led |
| Large consultancy | Deloitte, KPMG, Big 4 | $40K-$150K+ CAD | Slide-heavy, slow to ship |
Plus a monthly retainer ($300-$800 CAD) for hosting, monitoring, and iteration on flagged calls. Plus per-minute voice-model + telephony costs of $0.08-$0.15/minute (a 400-call month with 3-minute calls is about $144).
Where Compass sits: scope-dependent within the boutique-agency range. Smaller voice-only after-hours builds anchor near $8K. Full multi-system integrations with regulated industries (Law Society compliance, audit trails, escalation logic, CRM bidirectional sync) sit at $25K-$30K. The scope is locked at the Charting stage post-Bearings call — there's no "estimate" you discover halfway through the build.
Where humans still have to step in
Three places, in this order of frequency:
Complex pricing or scoping conversations. The agent can quote standard rates. If the caller wants a custom proposal, the agent should book a callback with the owner, not improvise pricing.
Anything emotionally heavy. A grieving family calling a funeral home, a tenant in crisis calling a property manager, a frustrated existing client calling about a billing dispute - the agent should recognize tone and escalate cleanly.
The first month of go-live. No matter how well-trained the agent is, the first 100-200 real calls surface edge cases nobody anticipated. This is where the iteration retainer earns its keep.
How to evaluate one before you buy
If you're getting a pitch from any vendor - including Compass - ask for these:
Listen to three real call recordings from another deployment. Marketing demos are scripted. Real call recordings (anonymized) show what actually happens.
Ask about the escalation rules. "What happens if a caller starts crying" is a real question. So is "what happens if a caller asks for a refund."
Get the cost breakdown line-by-line. Voice model, telephony, retainer, build. If the vendor can't break it out, they're either inexperienced or hiding markup.
Ask about the failure modes. A good vendor has a list of things their agent doesn't do well and will tell you in the first conversation.
If you want to hear what a real Compass-built receptionist sounds like, the number on the homepage is live - call it.
- Bobby