I Built an AI Film Crew That Directs Its Own Cartoons

The most useful thing my AI built this week was an argument. With itself.

One pass writes the cartoon — a director laying out a scene-by-scene shot list with an actual story arc. A second pass reads what the first one wrote, decides it’s not good enough, and rewrites it. Same project, two jobs, one model talking itself out of its own first draft.

That sounds like a gimmick. It isn’t. It’s the only reason any of this works, because the model doing the writing is a free one — weak, cheap, prone to wandering off — and a single weak pass produces garbage. The fix wasn’t a smarter model. It was a crew.

Here’s the part I want on the record up front, because the internet is full of “my AI makes videos” posts where the video is a screenshot: this thing makes real cartoon shorts you can download right now. It does not upload them anywhere by itself yet. I’ll be exact about that line later. First, the crew.

The pitch: one model, a whole film crew

The tool I’m building directs, writes, animates, and voices its own vertical cartoon shorts. Stickman cartoons — but with a real arc, real camera changes, a real voice reading real lines, and original music underneath. There’s a live work-in-progress demo sitting at an unlisted /lab/stickman/ URL while I keep breaking and fixing it.

The trick is to stop treating the language model like a magic box and start treating it like a film set with roles:

The director/writer pass plans the short — scene by scene, beat by beat — into a structured shot list. What’s on screen, what pose the character holds, what it says, where the camera sits. This runs on glm-4.5-flash, a free GLM model, behind a Cloudflare Worker.
The editor pass then reads the director’s draft like a critic and fixes it. Pacing too flat? Scene that doesn’t earn its place? Line that doesn’t land? The editor flags it and rewrites before a single frame renders.

Two passes, one model, and the output of round one is just raw material for round two. That self-critique loop is the difference between a coherent thirty-second story and a stickman flailing through disconnected nonsense.

Why a free model needs a cage to be useful

Here’s the honest engineering reality nobody puts in the demo reel: a weak free model will derail. Give it room and it invents poses that don’t exist, writes captions that don’t match the action, forgets the character is supposed to be asleep and stands it bolt upright in a bedroom.

You don’t fix that with a nicer prompt. You fix it with a harness — hard rails the model physically cannot cross.

So after the director and editor have had their say, two more crew members step in, and neither of them is an AI:

The enforcer (deterministic, no opinions)

The enforcer is plain code. It takes the model’s shot list and forces it to be physically true. The clearest example: if a scene says the character is asleep, the enforcer guarantees the renderer actually lays the figure down in a bed — not standing, not “kind of slouched,” lying down. The word “asleep” has to become a body in a bed, every time, whether or not the model remembered to ask for it.

Getting that right was real work. “Asleep” looking like sleep, and “coffee” being a cup held in a hand instead of floating next to the character — those two tiny things ate more time than the entire scripting layer. The model gives you intent. The enforcer turns intent into something that doesn’t look broken.

The diversity verifier (catches the lazy take)

The other guard is a diversity verifier. Weak models love to repeat themselves — same pose, same background, same beat, four scenes in a row. The verifier catches that sameness and makes the crew vary it, so a short actually moves instead of stalling on one frame for thirty seconds.

Director writes. Editor critiques. Enforcer makes it physically true. Verifier makes it not boring. That’s the whole reason a free model can carry this — it’s never trusted alone. If the idea of stacking dumb-but-reliable parts into something that behaves sounds familiar, it’s the same bet I made building a full web app with AI agents for $0: the moat is never the one clever model call, it’s the boring orchestration around it.

The renderer: a stickman rig that travels through a little world

The crew can plan a brilliant short, but something has to draw it. That something is a hand-rolled Canvas-2D skeletal rig — I didn’t import a game engine, I built the figure from joints up.

What it can do:

~20 named poses — sleep, sit, sip, run, dance, celebrate, facepalm, and a dozen more — so the shot list can call a pose by name and the rig snaps to it.
~20 props the character can actually interact with, held in the right place (yes, including that hard-won coffee cup).
Nine different backgrounds — bedroom, night, city, office, cafe, outdoors, sky, workshop, and a grid. A single short journeys through several of them, so the character moves from set to set like it’s walking through a tiny game world instead of standing in one void the whole time.

That last point matters more than it looks. Nine sets is what turns “a stickman in a box” into “a story that goes somewhere.” The director can stage a scene in the bedroom, cut to the cafe, end on the city skyline — and the renderer just has those places ready.

This is the same engine I’ve been hammering on for weeks — except now there’s a director, an editor, and a world to move through, not just a figure that can pose.

Voice and music: timed to the words, original every time

A silent cartoon is half a cartoon. So the short gets a real voice and a real score.

Voiceover runs through ElevenLabs, and here’s the detail I’m proud of: the timing is audio-driven. The captions don’t guess — they’re paced to the actual generated speech, so the words on screen land exactly when the voice says them. No drift, no captions racing ahead of the audio.

Music is generated procedurally, from scratch, unique to each video. Not a library track, not a loop everyone else is using — original background audio my code composes per short. Nobody else has a claim to a waveform that was invented thirty seconds ago, which keeps the whole thing clean to use.

All of it — the writing, the voice, the music, the render — runs on free tiers. $0. That’s the constraint the entire project is built inside.

The honest wall: assembly happens in your browser

Every real build has a wall you didn’t see coming. Mine is assembly — turning frames plus voice plus music into one playable video file.

Normally you’d reach for ffmpeg. But my stack lives on Cloudflare Workers, and Workers can’t run ffmpeg — there’s no filesystem, no binary to shell out to. So the final video is recorded and assembled in the browser tab itself, client-side, and handed to you as a download.

It works. It’s also the least glamorous sentence in this whole post, and it’s where most of the genuinely brutal debugging went. The honest version of “I built an AI that makes cartoons” includes “and I spent days fighting video encoding inside a browser tab.”

Status, with zero spin

Precision is the brand here, so let me draw the line clean:

Working today: the director→editor→enforcer→verifier crew, the stickman engine with ~20 poses, ~20 props, and nine backgrounds, ElevenLabs voiceover with audio-synced captions, original per-video music, and final assembly into a downloadable video. You run it, you get a cartoon. It lives at an unlisted /lab/stickman/ — a work in progress, not something I’m pointing an audience at yet.
NOT built: auto-upload. It does not post to YouTube or anywhere else on its own. It makes the file; I download it. Anyone who tells you their AI “posts by itself” should show you the account, not the demo. Mine isn’t there yet, and I’m not going to pretend it is.

Why I’m building it this way

The scripting crew runs on the GLM Coding Plan — it’s the cheapest way I’ve found to keep an AI agent grinding on a multi-pass, multi-runtime project all day without a metered-token meltdown. (That’s my referral link — it costs you nothing extra and helps fund the compute behind this build.) When the model itself is free and weak, you spend your effort on the cage around it, not the bill.

But the tool is the thesis, not the plan behind it: AI agents plus one broke human, building real software for $0, out loud. I’ve been automating things since before GPT-2 was a headline, and the pattern never changed — the win was never a smarter model, it was the unglamorous scaffolding a stranger with a chat window can’t reproduce. A director that argues with its editor is just that idea wearing a costume.

What’s next

The next milestone is the one I’m deliberately saving for last: closing the loop so a finished short walks itself to an upload, hands-free. The hard creative parts — the crew, the world, the voice, the score — are the parts that are real today. The upload is wiring, and I’ll log the day it actually holds.

Until then, it makes the cartoon and hands it to me. A film crew the size of one free model, drawing little stories in a stickman world. That’s the build. The rest is in public, one runtime at a time.

I Built an AI Film Crew That Directs Its Own Cartoons — and Argues With Itself