The Evolution of Video Models

MOTION · FOUR VEO GENERATIONS · ONE SCENE

The brief got quieter.

I've been re-prompting the same eight-second scene since late 2024. Me at the Moog in the studio. A slow four-chord progression, warm key light, dust in the air. Nothing fancy. The kind of clip I'd cut into a portfolio reel.

The output got better, predictably. The brief got shorter, much less predictably. Veo 2 needed 260 words and a workaround for the silence. Omni needed fourteen words and a follow-up note. Watch the four briefs side-by-side and the architectural story tells itself.

BRIEF260

A medium-shot, eye-level cinematic video. A bald man in his early forties — full beard, light stubble at the jawline, dark slate-grey hoodie under a heavier black overshirt — is seated at a vintage analog synthesizer in a converted Rotterdam warehouse studio. The synth is a Moog Voyager, wood side-panels, two-handed posture, his right hand on the keys at mid-keyboard, his left hand adjusting the cutoff knob in the upper-left corner of the panel.

Soft, even key light from camera-left at roughly 3200K. A cooler 4500K rim light from camera-right, separating him from the dark concrete wall behind. Visible dust particles in the shafts of light. Camera: 35mm lens, slight push-in at 5% over the eight-second duration. The subject is fully concentrated; he does not look at the camera. He plays three or four chords across the eight seconds. His hand on the cutoff knob moves once, slowly, mid-clip.

Frame the shot tight enough to read his face but wide enough to include the synth panel from edge to edge. Exposure values warmer in the midtones, deep blacks in the shadows. Style reference: cinematic film, 1080p, 24 fps, slight cinematic film grain. No text on screen.

*No music in the generated audio — the synth itself should not be audible, this clip will be scored separately in post.*

OUTPUTVEO 2

VEO 2 · DEC 2024 · 260 WORDS

Architectural state: Latent diffusion, no native audio path, no multi-turn memory.

What the brief has to carry: The entire world. Coordinates, color temperatures, lens choice, posture, motion timing. The last line of the prompt is a workaround for a capability the model doesn't have.

Call it the Briefing Inversion. The relationship between director and renderer is flipping. The Veo 2 brief reads like stage directions because the model couldn't infer anything — the prompt had to be the world. The Omni brief reads like a note to a competent colleague who has already read the room. Same scene. Same imperfect reference. Different cognitive contract.

The four-column truth

Generation	Brief carries	Model infers
Veo 2	The whole world. Hardware-level specs. A workaround for silence.	Almost nothing — it renders what you describe.
Veo 3	Scene + mood + audio direction.	Hardware-level visual details. The bridge between cutoff knob and sound.
Veo 3.1	A paragraph of intent. A reference image. Two compute tiers.	Character identity. Wardrobe. Camera path from an adjective.
Omni	A sentence. A series of follow-up notes that each cost nothing.	Physical continuity across edits. The architecture of the scene.

The Briefing Inversion. The brief used to be the world. Now the model is the world, and the brief is a note to it.

That contract change is what the rest of this page is about. The four sections below are four different ways of looking at why it happened.

CATALOG · WHAT EACH MODEL IS ACTUALLY FOR

Four lineages. One family.

The shrinking brief in Section 1 is the consequence. The cause is that Google quietly built four parallel model families between 2023 and 2026, and then collapsed them into one. To understand the collapse, it helps to see what each was actually for — and what it's still actually for, even after Omni absorbed the rendering jobs.

Four cards below. Pick the one you're closest to using.

All four still exist as products. Omni didn't kill them — it consolidated the rendering pipeline, not the roadmap. Veo 3.1 Lite still ships because cost-tiered batch rendering is a different job than conversational editing. Lyria RealTime still ships because adaptive game music is a different job than one-shot composition. Pick the model that matches the job. The marketing department's "Omni does everything" is true at a high level and unhelpful at a workflow level.

ARCHITECTURE · CASCADE VS. SINGLE-PASS

Why one model replaced three.

Until May 2026 the standard way to make AI video with sound was a chain of tools. Veo renders the picture. A separate model dubs in audio. A separate editor handles the cuts. A separate watermarking pass embeds provenance. Four tools, four context windows, four moments where the seams can show.

The seams do show. Audio drifts a few frames out of lip-sync by the second cut. The character's hand looks slightly different in the close-up because the second tool didn't see the first tool's full latent state. Six hours of post-production go into hiding what is, structurally, the same problem: every tool only sees its own slice.

Gemini Omni doesn't fix that problem. It refuses to have it.

Here's what changed under the hood.

PANEL A · CASCADE

Accumulates errors. Loses character consistency across cuts. Audio drifts. Six hours of post-production go into hiding the seams.

PANEL B · SINGLE-PASS

Single coherent pass. The model sees every modality at once. Audio amplitude peaks are mapped to visual timeline as part of the same computation that places the pixels.

What it enabled — three jobs.

CHANGE THE BUTTERFLY TO A BEE. DON'T CHANGE ANYTHING ELSE.

Change the butterfly to a bee. Don't change anything else.

The edit lives in the conversation, not the regenerated scene. Veo would have given you a new flower in a new room.

VISUALS CHOREOGRAPHED TO A TRACK YOU UPLOAD.

Visuals choreographed to a track you upload.

The model isn't generating audio to match video. It's parsing your audio's amplitude peaks and treating them as a visual timeline. Veo couldn't read the file at all.

WORDS ON SCREEN THAT DON'T GARBLE.

Words on screen that don't garble.

Diffusion text usually drifts mid-frame. Single-pass tokenization locks letters to the temporal grid. This was the hardest of the three, technically.

LONG-FORM CONTINUITY. STILL NOT SOLVED.

Long-form continuity. Still not solved.

Omni Flash caps at 10 seconds. That's a deployment decision — the model could go longer, the GPU bill becomes prohibitive. Rigid-body physics edge cases (collisions, collapses) also still misbehave. The architectural shift didn't fix everything; it shifted the frontier.

HONEST LIMITATION

The architectural collapse is real, but it isn't total. Three creative jobs became natively possible. One important class of job didn't get solved. The marketing department's "Omni does everything" is a useful sentence in a keynote and a misleading one in a procurement meeting.

The most visible consequence of single-pass attention is that the editing surface is no longer a timeline. It's a conversation. Section 4 lets you have one.

SIMULATOR · CONVERSATIONAL EDITING · MOCK

Now you try.

Reading about conversational editing isn't the same as typing into it. The simulator below isn't real — it's a fake with scripted responses. But it's faithful to how Omni's editing surface actually behaves: each turn modifies the previous state instead of regenerating from scratch.

The seed scene is the one from Section 1 — me at the Moog, eight seconds. Try a few edits. Watch what changes and what doesn't.

INPUT SLOTS

TRY THESE

Mock Omni monitor — seed scene at the Moog

OMNI FLASH · MOCK

TURN 1 OF UNLIMITED

Push-in: slow · Reference: 08-desk.png · Audio: synth synced

YOU

[reference image attached: 08-desk.png] Me at the Moog. Eight seconds. Cinematic, slow chord progression. Warm + cool light mix.

OMNI (MOCK)

Generated. The character matches the reference's body and wardrobe — face is reconstructed from the partial signal. Eight-second clip at 1080p with synced synth audio. Slow push-in across the duration.

Two things to notice about your last edit:

One: the room stayed. The lighting stayed. The Moog stayed. Even when you asked for a new element, the rest of the scene was preserved. That's the multi-turn context surviving across the conversation.

Two: you wrote like you were talking to someone. Not specifying coordinates. Not writing in third person. The mode of writing the brief shifted before you noticed.

That conversational comfort has a cost. Every turn is a fresh inference pass on a very large, expensive-to-run model. Section 5 makes that cost visible.

COMPUTE · WHY FAST, LITE, AND REALTIME EXIST

The cost of a frame.

Generative video looks like it has no marginal cost. It doesn't. Pull the slider below to see what a clip actually costs in electricity. The number is the reason Google's product line looks the way it does — and it's the unspoken constraint behind every roadmap conversation a CMO is about to have with a vendor.

10 seconds

~200 Wh per second generated

TOTAL ENERGY2.00 kWh

120MICROWAVE MINUTES

167SMARTPHONE CHARGES

2229W LED HOURS

Source: extrapolated from public AI compute studies. Specific per-model disclosures from Google are not available.

The video number is the one that matters. Generating one minute of AI video burns roughly 200 watt-hours per second — that's about twelve minutes of running a microwave for every second of footage. A four-minute social-cut is roughly forty dishwasher loads' worth of electricity.

That's why Veo 3.1 Fast costs 90% less than standard Veo 3.1. That's why Veo 3.1 Lite exists. That's why Lyria RealTime ships in 2-second chunks instead of full songs. That's why Gemini Omni Flash caps at 10 seconds.

The tiering isn't product-line confusion. It's the visible surface of a thermodynamic constraint.

Cost-and-latency table

Model · tier	Estimated training compute	Typical inference latency	What it's optimized for
Veo 3.1 (Premium)	Largest training scale (undisclosed)	60–90 s for 8-second clip	Cinematic 4K, high-physics fidelity
Veo 3.1 Fast	Compressed variant	12–20 s for 8-second clip	Cost containment, iteration speed
Veo 3.1 Lite	Further compressed	Sub-15s for 1080p	Lowest-cost batch rendering
Lyria 3 Pro	Medium training scale	15–30 s for 3-min track	Structured composition with intro/verses/chorus
Lyria RealTime	Streaming architecture	~2 s per 2-second chunk	Continuous interactive scoring
Gemini Omni Flash	Unified multimodal core	10–15 s for 10-second clip	Multi-turn conversational editing

Latency is what the API tells you. Cost is what the procurement team asks about. They're the same conversation in different vocabularies.

So when a vendor's deck shows you a Premium tier and a Fast tier and a Lite tier for what looks like the same model, the choice isn't really between quality and speed. It's between quality and how many of these you can afford to render this quarter. That's the procurement question. The resolution number isn't.

And it's why Omni Flash caps at 10 seconds. Not because the model can't go longer. Because at the unit cost of a single second of unified-multimodal inference, longer is a business decision someone hasn't authorized yet.

Which brings us to the only paragraph on this page that's actually about your meeting on Thursday.

INVITATION · WHAT TO DO WITH THIS

The brief, on Thursday.

The version of this conversation that happens in your office on Thursday isn't about resolution. It's about which vendor's roadmap survives 2027.

Here's the heuristic I keep coming back to. When a generative-video vendor pitches you, ask two questions. First: how many words is the brief that produces their best-case demo? If it's still 200 words of stage direction, you're looking at last-generation architecture in fresh marketing wrap. Second: how does the per-second compute cost trend across their tier line? If their Lite tier is approaching parity with their Premium tier on output quality, they've found the architecture that scales. If it isn't, they're going to lose to a competitor who has.

Neither of those is on a vendor's spec sheet. Both of them are sitting in plain sight in any demo a vendor will gladly give you. The shrinking brief and the flattening cost curve are the two leading indicators. Resolution numbers are a lagging indicator at best.

The shift isn't the model. It's the contract.
Between director and renderer. Between brief and output. Between vendor and buyer.

The Multiplier Myth.

The boardroom mistake that turns a multiplier into a margin-chop also turns a creative-AI roadmap into a procurement cliff. Different vocabulary, same shape.

Read the article

If you're working on a serious version of this question inside your own organisation, the drop-me-a-line invitation from the business articles applies here too. I won't pitch you anything. I'd just like to know what's actually working.

How this was made

The brief got quieter.

The brief got quieter.

The four-column truth

Four lineages. One family.

Veo · 4 generations

Lyria · 3 generations + RealTime

Genie · 3 generations + Project Genie

Gemini Omni · Flash now, Pro upcoming

Why one model replaced three.

What it enabled — three jobs.

Change the butterfly to a bee. Don't change anything else.

Visuals choreographed to a track you upload.

Words on screen that don't garble.

Long-form continuity. Still not solved.

Now you try.

The cost of a frame.

Cost-and-latency table

The brief, on Thursday.

The Multiplier Myth.