Key Takeaways
- A prospect judges your agent's voice in about 400 milliseconds, before words register.
- Keep the first reply under 800 milliseconds or abandonment climbs sharply.
- Open with a warm greeting and admit you are AI only when asked.
Every AI outbound call is won or lost in the opening, and at Dograh we build agents that treat those first seconds as an engineering problem, long before the pitch begins.
The first 15 seconds are a gut decision
A prospect decides whether to keep listening in seconds, and that decision runs on instinct.
Someone who picks up an unknown number is already halfway to hanging up. Data from Leads at Scale shows prospects decide whether to stay on a cold call in about 8 seconds, and around 82% hang up within the first 30 seconds when nothing grabs them. The first 15 seconds are the whole game.
The math gets worse before the call even connects. CallHippo's 2026 cold calling data puts the answer rate near 28%, with roughly 1 in 6 dials reaching a live person. When so few calls get picked up, every answered call is precious, and you cannot afford to waste its opening.
There is a biological clock running too. A 2025 study in the Quarterly Journal of Experimental Psychology found listeners form trait impressions of a voice, including how trustworthy it sounds, within about 400 milliseconds of the first word. A single "hello" is enough. Your agent is being judged before it finishes its own greeting.
That is why the opening deserves real engineering attention. The outbound voice market is climbing fast, from about $2.54B in 2025 toward a projected $35.24B by 2033 per Grand View Research, and the outbound segment is growing quickest. More agents will be dialing, so the ones that clear the first 15 seconds win the category.
Latency is the first thing a prospect hears
Before your words land, the delay in front of them already tells the prospect what they are talking to.
Humans expect a reply almost instantly. In natural conversation the turn-taking window sits around 200 to 300 milliseconds. AssemblyAI's research on voice AI latency shows that once a voice agent crosses one second of delay, conversations start to feel broken and abandonment climbs more than 40%.
The bar for a natural-feeling agent is under about 800 milliseconds end to end. Hamming AI's latency benchmarks put that number as the line between smooth and awkward, with quality falling off past 1,500 milliseconds. Most agents still land between 800 milliseconds and 2 seconds because delay compounds at every stage, from turning speech into text to turning the model's reply back into speech.
That stack is where openings die. A half-second pause after the prospect says "hello" reads as a machine thinking. We keep first-response latency low by colocating the speech and model services so there are fewer network hops between them, which is far easier when you can self-host the pieces. The full breakdown of how to hit that budget lives in our sub-800ms speech latency playbook.
A human voice beats a clever script
The words matter less than whether the voice sounds like a person the prospect would actually talk to.
You can write the perfect opener and still lose if it sounds synthetic. The reason ties back to that 400 millisecond judgment. If the voice feels off, the prospect stops listening to the words and starts listening for the seam where the robot shows through.
Our approach is to open with a real human recording. In Dograh, the agent plays a pre-recorded human clip when one fits the moment, and only falls back to generated speech when the conversation goes off-script. We recommend recording the opener with a real voice, then cloning that same voice for the text-to-speech fallback so the call sounds like one consistent person from the first word to the last.
This hybrid also cuts cost by roughly 3x against full text-to-speech, and it trims latency because a pre-recorded clip plays instantly with no synthesis step. The opening line, the part that decides the call, is the part most worth capturing as real audio.
Faking a human is a mistake, and so is a flat synthetic greeting. The opener should carry the warmth of a real person, because the prospect's ear makes up its mind in milliseconds. How you say the first line is worth more than what it says.
Open Source Alternative to Vapi / Retell
Self-hosted voice agent platform — no per-minute fees
dograh-hq/dograh
Star on GitHub
Give them a reason only a real call would have
A personalized hook in the opening line is what separates a welcome call from a robocall.
Generic openers get treated like spam. The fix is to say something in the first sentence that proves the call is about this specific prospect.
The lift here is large. Martal Group reports that referencing recent company news or a specific pain point in the opener can raise call-to-meeting conversions by up to 70%, while AI-driven personalization improves conversion by 30 to 50%. That only works if the agent knows the detail before it speaks.
Dograh handles this with a pre-call fetch that pulls fresh data before the call connects, such as an order ID or a renewal date from your CRM. The agent walks in already knowing why it called this specific person. Writing that opening line well is its own craft, and we cover the patterns in our voice AI prompting guide. For the wider view on outbound that converts, see our guide to making AI outbound calls work.
Handling the "is this a robot?" moment
How and when the agent addresses being AI changes whether the prospect stays or drops off.
At some point in a good call, the prospect wonders who they are really talking to. Leading with "I am an AI assistant" as the very first words tends to backfire. Regal.ai found that when AI identity is announced abruptly at the start, people hang up faster or collapse into short robotic yes and no answers, which kills the conversation you were trying to have.
The other extreme, pretending to be human when asked directly, breaks trust and can cross legal lines depending on where you call. The workable path sits in the middle. Open naturally with a warm, personalized greeting, and when the prospect asks whether they are talking to a real person, answer plainly and keep the conversation moving.
A big part of handling that moment gracefully is turn detection, the model's ability to tell when the prospect has actually finished speaking. Weak turn detection creates awkward pauses and agents that talk over people, both of which scream "robot." Newer models like Deepgram's Flux turn-detection system cut those collisions so the exchange feels human even when someone interrupts.
Join the Dograh Community
Dograh is an OSS alternative to Vapi. Join our Slack community for queries, releases, best practices & community interactions.
Put all of this together and the opening is really an infrastructure decision. A permission-based opener with an honest recording disclosure is good practice, and the words of the first line still matter. For an AI call, though, the prospect reacts to the delivery layer first, the speed of the reply and the humanity of the voice, before a single word of the script has time to work.
So the opening we build at Dograh runs as one system. Low first-response latency comes from colocating open-source speech and model services to cut network hops, which is only possible when you self-host the stack. The human opener is a pre-recorded clip with a cloned-voice fallback, and the relevant hook comes from the pre-call fetch that runs before the phone rings. Because the platform is open source and self-hostable, the call data never has to leave your own servers, which matters for regulated outbound like collections and healthcare reminders. We go deeper on that in our take on why on-prem will win enterprise voice AI.
The opening is the cheapest place to win an outbound call and the easiest place to lose one. Get the first reply out fast and make the first line sound like a person who has a real reason to call. Do that, and the rest of the call finally gets a chance to happen.
Glossary
- Turn detection
- The model's judgment of when a speaker has actually finished a turn, so the agent replies at the right moment instead of interrupting or leaving an awkward gap.
- Time to first token
- The delay between the prospect finishing speaking and the language model producing its first piece of output. It is a major contributor to the pause a caller hears before a reply.
- Barge-in
- The ability for a caller to interrupt the agent mid-sentence and have it stop and listen, the way a person would in a real conversation.
- Voice cloning
- Creating a synthetic copy of a specific real voice so the text-to-speech fallback matches the human recording used for the opener, keeping one consistent voice across the call.
Frequently asked questions
Short answers to what people ask about the opening seconds of an AI outbound call.
