ARTICLE · 05 / 05 · CREATIVE

How the podcast is made.

7 MIN READ·PUBLISHED JUNE 2026·FILED UNDER CREATIVE · METHOD

Rutger TuitWritten in a personal capacity — views are his own, not Google's.

The Seam is an AI-made podcast, and the hard part was never generating the speech — that’s solved. The hard part is making it sound like three people in a room instead of three clean voiceovers. The workflow does that by deliberately writing the imperfection back in: scripts staged like a screenplay with the stumbles left on, voices on ElevenLabs’ most expressive setting, real overlapping audio, one shared “microphone” across every speaker, a little room, and a seeded layer of chairs and coughs that never repeats. Prompted, then chosen.

01The uncanny part

The first time I listened back to a finished episode, the unsettling thing wasn’t that the voices were synthetic. It was that one of them was mine — a clone of my own voice, arguing about marketing with four people who have never existed, in a studio that was never built. And it sounded, more or less, like a real conversation.

That “more or less” is the whole job. Getting a machine to say words convincingly is, by 2026, a solved problem. Getting a machine to sound like a room — people breathing, bumping the mic, talking over each other, thinking out loud — is not. The Seam is my standing experiment in closing that last gap, and this is how it actually gets made.

A vintage broadcast microphone beside its faintly out-of-sync reflection in dark glass; warm rust light, fading to black.

02Clean is the tell

Here’s the counter-intuitive part. The instinct with a new tool is to chase fidelity — cleaner audio, crisper words, fewer artefacts. But clean is exactly what gives a synthetic podcast away. A real recording is full of small failures: a chair creaks, someone starts a sentence twice, two people land on the same word and one backs off, a cup goes down on the table mid-thought. Strip all of that out and you get something technically perfect and instantly fake — the audiobook uncanny valley.

So the work runs in the opposite direction to the tooling. The model wants to give you a flawless read; the craft is putting the flaws back in, on purpose, in the right places. Realism lives in the seams — which is, conveniently, the name of the show.

A flawless acoustic tile beside a worn, chipped one on a workbench; the worn one lit warm, the perfect one in shadow.

03The pipeline

The workflow is a small set of scripts, run in order. First the script itself, written like a screenplay rather than an essay: a fixed cast with distinct registers, a real argument to have, and — critically — the disfluency written into the page. Half-finished sentences, “well, no, I mean”, a host who thinks out loud and a guest who is dry and certain. The model performs what’s on the page, so if the page is too tidy, the read is too tidy.

Then the voices. Each character is an ElevenLabs voice — mine is a clone of my own; the rest are designed — rendered through the v3 text-to-dialogue model, which takes the whole conversation at once and matches prosody across speakers. The expressiveness dial is set to its loosest (“Creative”), which trades a little stability for a read that hesitates and lands like a person rather than a narrator. On top sits a bookend trailer and a small ad system, so each episode opens and closes like a real show.

Marked-up script pages resting on the faders of an analogue mixing console, raked by warm rust light.

04Engineering the flaws back in

This is the part that actually sells it — a stack of deliberate imperfections, layered after the voices are generated:

Real talk-over. The dialogue model renders turns one after another, so a written “[interrupting]” never actually overlaps. To make two people genuinely collide, the interrupting line is rendered as its own clip and mixed back over the tail of the previous one — so you hear them both for a beat, the way people really cut in.
One microphone. Each cloned voice arrives with a slightly different tone. A single shared channel strip — a broadcast EQ in the spirit of an SM7B — is printed on every speaker, so instead of three voices on three mics you get one consistent room.
A little room. A very slight reverb sits in front of that “mic”, so the voice reads as captured in a space, not in a dead vacuum — air around it, not an echo.
Foley that never repeats. A small generated library of chair creaks, cups, paper, soft coughs and breaths is sprinkled under the conversation — sparse, low, and jittered every time (pitch, level, stereo position, timing, a per-episode seed) so no two placements sound the same. It’s felt, not noticed.
Dynamics left alone. A gentle bus compressor and a wide loudness range, plus a faint room-tone bed under the whole thing, so the track breathes instead of sitting flat and over-polished.

None of these are clever individually. Together they move the result from “impressive demo” to “wait, were they actually in the same room?”

Patch cables tangled with foley objects — a chair leg, a coffee cup, crumpled paper — in warm rust light, fading to black.

05Prompted, then chosen

The line at the bottom of this whole site is “prompted, then chosen,” and the podcast is the clearest example of it. The model generates endlessly; it will hand you a competent version forever. What it can’t do is decide which take has a pulse, where a beat should breathe, when an interruption is funny versus annoying, or whether the room finally sounds real. That judgement — the choosing — is the part that stays human, and it’s the part the tools getting better only makes more valuable, not less.

So: nothing here was recorded in a studio, and every second of it was chosen by one. If you want to hear where the seams landed, the show is a click away — go listen for the chair that creaks at exactly the wrong moment. That one was on purpose.

A row of identical tape reels in shadow with one pulled forward into warm light — the chosen take.

SOURCES & METHODOLOGY

Where the numbers came from.

The podcast itself — all episodes of The Seam.
Listen at rutgertuit.nl/podcasts. Every voice is synthetic; the views are personal, not Google’s.
The voice model: ElevenLabs Eleven v3 (text-to-dialogue).
Multi-speaker generation with shared prosody and audio tags — elevenlabs.io/docs.
The full toolchain is listed openly.
Models, infrastructure and prompt pipelines are itemised in the colophon.

If any claim here is mis-cited or out of date, mail me at rt.nl/contact and I'll fix or retract.