I'm Building an AI That Writes, Animates, and Voices Its Own YouTube Shorts

The first video my tool will ever publish to YouTube is an ad for the tool — written by the tool, animated by the tool, voiced by the tool, uploaded by the tool. A machine making a commercial for itself, hands-free, while I watch.

That’s the goal. It is not done. I want to be dead honest about that up front, because the internet is drowning in “I built an AI that does everything” posts where the everything is a screenshot and a prayer. This is a build log, written from the middle of the build. Some of it is live and playable right now. Most of it is still wiring on my workbench. I’ll tell you exactly which is which.

Here’s the part most people get wrong before they even start.

The thing you can’t get from a single prompt

Ask any chat model: “make me a vertical YouTube Short about today’s news.” You’ll get a script. Maybe a thumbnail idea. A list of tags.

You will not get an MP4 on your channel. And you never will — not because the model is dumb, but because of what a chat prompt is. A prompt is stateless. It runs once, returns text, and forgets you exist. It cannot run a pixel loop for 900 frames. It cannot hold a server-side secret to sign an OAuth upload. It cannot duck a music track under a voice line at the right millisecond. It cannot schedule itself to wake up tomorrow and do it again.

The value isn’t the one model call. The value is the orchestration — a dozen runtimes stitched into a pipeline that survives between steps:

a scheduler that fires on its own
server-side secrets that never touch the browser
a deterministic pixel render of a character that looks the same every frame
frame-accurate audio/video muxing
an authenticated upload that posts to a real account

If a single prompt could do all that, there’d be no reason to build it — and no moat. The whole reason this is worth my time is that it can’t be one prompt. The orchestration is the product. (If “orchestration” still feels abstract, I broke down what an AI agent actually is, in plain English — that’s the mental model this whole thing runs on.)

So that’s the destination. Here’s the full machine I’m building toward, and then the one piece that’s actually alive.

The full pipeline (the map, not the territory yet)

End to end, the autonomous loop looks like this:

Scrape the news. Pull the latest headlines so the tool invents its own video ideas instead of waiting for me to feed it prompts.
Script it. An LLM (I’m running GLM) turns a headline into a tight, punchy Shorts script with scene beats.
Animate it. A stickman engine renders the visual — a consistent character acting out the script.
Voice it. ElevenLabs, or in-browser Kokoro, reads the script.
Score it. Procedurally generated, original background music, ducked under the voice.
Assemble it. Everything muxes into one vertical MP4.
Upload it. Auto-post to YouTube with an AI-written title, description, and viral tags.

Seven steps, seven different problems. And the order I’m solving them in is deliberately backwards from the glamour. The script and the tags are the easy part — that’s the one-prompt stuff everyone can already do. I started with the part nobody talks about, because it’s the part that actually decides whether any of this is real.

What’s actually built and live right now

The stickman animation engine. It exists, it runs, and you can play with it.

It’s a hand-rolled Canvas-2D skeletal rig — I didn’t import a game engine, I built the skeleton from scratch:

a joint tree (hips → spine → shoulders → elbows → hands, the whole hierarchy)
forward kinematics so rotating a shoulder carries the forearm and hand with it, like a real arm
a named pose library: idle, think, point, type, jump, celebrate, and more — so the script can call a pose by name and the rig snaps to it
scene backgrounds and props so the character isn’t acting in a void
blink and breathing micro-motion, because a perfectly still figure reads as dead — the tiny involuntary stuff is what makes it feel alive
captions synced to scenes, so the words land with the action
and it records the animation to a video right in the browser

That last one matters more than it sounds, and it’s the source of the single hardest engineering problem in this whole project. Which brings me to the wall.

The hard part nobody warns you about: assembly

Here’s the trap. You imagine the AI parts are the hard parts. They’re not. The script is a prompt. The voice is an API call. The animation I built by hand, and that was real work — but the genuinely brutal problem is assembly: turning frames + voice + music into one playable MP4.

Normally you’d reach for ffmpeg, the swiss-army knife of video. Except my stack is built on free infrastructure — Cloudflare Workers and friends — and Workers can’t run ffmpeg. There’s no filesystem to shell out to, no binary to invoke. (That free-tier-everything constraint is the same one I write about in free website hosting with Cloudflare, no credit card — it’s a feature, but it has sharp edges, and this is one of them.)

So the render and assemble happen in the browser instead:

Canvas-2D draws every frame of the stickman.
WebCodecs encodes those frames straight to MP4, in the tab, no server.
A WebM fallback catches browsers where the MP4 path isn’t available.

Doing frame-accurate video encoding client-side, with the audio lined up to the visuals, is the kind of problem that eats a week. That’s the difficulty of this entire project — not the AI, the mux. I want that on the record, because the honest version of “I’m building an AI video tool” is “I spent most of my time fighting a codec in a browser tab.”

The licensing landmine (and how I’m walking around it)

There’s a quieter problem hiding in the voice and the music: rights.

ElevenLabs’ free tier is great, but it’s non-commercial. The second this tool publishes a video that could make a dollar, that free voice becomes a liability. So for production, the plan is Kokoro running in the browser — a commercial-clean model — so the output is mine to monetize without a lawyer in the loop.

Same logic on music. Instead of licensing tracks or gambling on “royalty-free” libraries that turn out to have strings attached, the background score is generated from scratch — original audio, procedurally composed, ducked under the voice. Zero licensing risk because nobody else has a claim to a waveform my code invented thirty seconds ago.

This is the kind of decision that doesn’t show up in a demo but absolutely shows up in a takedown notice. Building broke means you can’t afford a legal mistake, so you engineer the mistake out of existence instead.

The status, with no spin

Let me be precise, because precision is the whole brand here:

Live and playable: the stickman engine. Skeletal rig, pose library, scenes, props, blink/breathing, synced captions, in-browser recording. It sits at an unlisted /lab/ URL — a work-in-progress demo, not a finished product, not something I’m pointing an audience at yet.
In progress / not done: the autonomous loop around it. News scraping, the GLM scripting handoff, the production voice swap to Kokoro, the procedural music, the full MP4 assembly pipeline, and the OAuth YouTube upload.

So no — the all-in-one autonomous Shorts factory is not finished. What’s finished is the hardest visual piece and a real proof that the in-browser render path works. The rest is wiring I’m doing in public, one runtime at a time.

Why I’m building it this way

I run the scripting on the GLM Coding Plan — it’s the cheapest way I’ve found to keep an AI agent grinding on a multi-runtime project all day without a metered-token meltdown. (That’s my referral link; it costs you nothing extra and helps fund the compute behind this build.) When the work is “stitch a scheduler to a render loop to an OAuth flow,” you want an agent you can let run, not one you’re rationing by the token.

But the tool itself is the thesis, not the plan behind it. AI agents plus one broke human, building real software for $0, out loud. I’ve been automating things since before GPT-2 was a headline — scrapers, cron jobs, glue code — and the pattern never changed: the moat was never the clever model call. It was always the boring, unglamorous orchestration that a stranger with a chat window can’t reproduce. That’s the same bet I made on building a full web app with AI agents for $0, and it’s the same bet here.

What’s next

The next milestone is closing the loop on one full short — headline in, MP4 out — even if it’s ugly. Then the upload. And the first thing it’ll publish, once the wiring holds, is that ad for itself: a stickman explaining the tool that drew it, narrated by a voice the tool generated, scored by music the tool composed, posted by code the tool runs.

A machine making a commercial for itself. That’s the open loop I’m chasing, and I’ll log the day it actually closes.

If you want the origin of why I’m doing all of this broke and in the open, it starts at day zero, rebuilt from nothing. The stickman is just the first thing that learned to move.