The CFO Was Never On the Call: Deepfake-Driven BEC in 2026
A finance director joins a Zoom call. The CFO is on the screen, voice and face perfectly familiar, requesting an urgent wire transfer. The transfer goes through. The CFO never logged in.
In 2024, this exact playbook cost engineering firm Arup roughly twenty-five million dollars in Hong Kong. In 2026, the cost of running this attack has fallen below five US dollars and requires under thirty seconds of public training audio. The infrastructure to do this at industrial scale is now sitting in consumer SaaS products.
// the threat model has shifted
Traditional BEC playbooks assume a text-based attack: spoofed email, lookalike domain, social-engineered urgency. Defensive guidance was built around DMARC, DKIM, SPF, and "verify the sender's email domain." All of that still matters. None of it covers the current attack vector.
The current attack vector is real-time voice and video synthesis, deployed on live conferencing platforms. Open-source models like FaceFusion and commercial offerings like ElevenLabs Pro have collapsed the technical barrier. The latency required for a convincing real-time conversation has dropped below two hundred milliseconds. The training audio requirement has dropped to under a minute.
Sora 2 and Veo 3 enable pre-recorded video that survives casual scrutiny. The combination — pre-recorded video for the appearance plus real-time voice cloning for the dialogue — is what attackers are using now.
// what mfa cannot save you from
The first thing to understand: this attack does not bypass authentication. It bypasses the human in the loop. Your finance director has authenticated correctly. They are on the right Zoom call. They are talking to what looks like the right person. The compromise is not at the auth layer — it is at the trust-the-call layer.
Identity verification at the start of the call does not help, because the attacker is on the same call as a legitimate participant. Speaker verification on the conferencing platform does not help — the platform sees a verified meeting host inviting a guest. The guest just happens to look and sound like the CEO.
// what actually works
The defensive controls below are not novel. They are operational discipline that most organizations have not implemented because, until recently, they felt like overkill. They no longer do.
Any wire transfer above an organizationally defined threshold requires verification via a callback to a pre-shared phone number. Not the number on the email. Not the number from the call. The number stored in the procurement system from when the relationship was established. The number that was set up before any social engineering took place.
Yes, like spy films. Pre-agreed code phrases between executives and finance teams, rotated quarterly, used as a final challenge for any approval over a defined value. The reason this technique appears in fiction is that it works in reality. A deepfake of someone's voice cannot reproduce a code phrase the original person never spoke.
Real-time deepfake models still degrade noticeably under unscripted physical motion. Ask the person to turn their head sharply, hold up a specific number of fingers, or move the camera. Pre-recorded video fails immediately. Real-time synthesis fails on novel gestures. This is a stopgap — the technology will improve — but in the current threat landscape it is effective.
No single human should be able to approve a transfer above a meaningful threshold based on a video call alone. Dual control — two distinct authenticated approvals through the financial system, not through the conferencing platform — moves the trust boundary back to systems with stronger guarantees than the human eye and ear.
Generic phishing training does not cover this. Finance staff, executive assistants, and treasury operators need specific tabletop exercises against deepfake scenarios. They need to feel the social pressure of being asked by a "C-level" to bypass procedure, and they need explicit organizational backing to refuse. "Trust your instincts" is not a control — clear procedural authority is.
// detection technology
Several vendors are building real-time deepfake detection for conferencing platforms — Reality Defender, Pindrop, Sensity AI. The technology exists. It is not yet good enough to be the only line of defense. Detection accuracy degrades against the latest generation of synthesis models, and the false positive rate creates real friction for legitimate calls.
The honest assessment in 2026: deploy the detection technology where you can, but do not depend on it. The procedural controls above carry the load.
// the larger pattern
This category of attack is the leading edge of a broader shift. The attack surface is no longer the email, the network, or the application. It is the trusted communication channel that humans use to coordinate work. The voice you recognize. The face on the screen. The conversational dynamics that signal legitimacy.
Application security as a discipline has historically been about code, infrastructure, and data flows. The discipline now extends to the human protocols that surround those systems. The threat model that does not include synthetic media is incomplete.
If your incident response runbook does not include "what we do when an employee reports an executive impersonation," it is missing a chapter that 2026 has made mandatory.