Skip to main content

AI Faceless YouTube channel

Mario Mora
Author
Mario Mora
A blog about my journey through InfoSec, Infrastructure, Cloud, and Security.

A note on this post: The first draft of this article was generated with AI assistance (Claude) based on our actual working session transcript. I reviewed, edited, and validated everything against what we actually did. All the technical details, errors, and lessons are real — the AI just helped me write it up faster. Kind of fitting given the topic.

I didn’t plan to spend an entire afternoon setting up a local AI YouTube pipeline. It started with a simple question: can I build a faceless educational channel using AI tools, mostly for free, and automate the whole thing?

Spoiler: yes. But the journey there was messier, more educational, and honestly more fun than I expected.


Where It Started
#

The idea was straightforward. I want to build two YouTube channels — one in English covering tech fundamentals (operating systems, networking, Linux), and one in Spanish targeting a Latin American audience covering personal finance basics. Both faceless, both AI-generated, both automated enough that I can batch-produce content without it consuming my life.

As someone who’s spent 17 years in DevOps and infrastructure, I’m comfortable with tooling and automation. But the AI content creation space was new territory. I knew the names — Stable Diffusion, LoRA, ComfyUI — but hadn’t touched any of it seriously.

So I dove in.


The Stack We Built
#

Here’s what ended up running locally on my MacBook M4 Pro with 64GB unified memory:

Image generation — Flux.1 Dev via ComfyUI, with a Stick Figure LoRA from CivitAI for the whiteboard animation style. Getting Flux running on Apple MPS (Metal) was its own adventure — fp8 quantization isn’t supported, so we had to fall back to fp16/default precision. Once that clicked, images started generating in about 2 minutes at 1280×720.

Voiceover — Kokoro TTS running locally via ONNX. Free, surprisingly good quality, and it has solid Spanish voice options for the Latin American audience. No ElevenLabs account needed.

LLMs for scripting — Ollama with llama3.2 for quick tasks and qwen2.5:72b for serious script writing. The 72B model on 64GB unified memory runs at a very usable speed, and its bilingual English/Spanish output is noticeably better than smaller models, especially for Latin American Spanish dialect.

Video generation — Wan2.1 (both the 1.3B text-to-video and 14B image-to-video models). Still working through Apple MPS compatibility — the fp8 models don’t run on Metal, and the native bf16 models require some workflow adjustments. This piece is still in progress.

Assembly — FFmpeg for stitching images and audio into video segments, then concatenating into a final MP4.

The automation glue — a Python pipeline script that takes a topic string and chains everything together: Claude API for script generation → ComfyUI API for images → Kokoro for voiceover → FFmpeg for assembly → YouTube Data API for upload.


What I Learned About Diffusion Models
#

One of the more interesting rabbit holes was actually understanding how image generation works rather than just running it.

The short version: you start with pure random noise and iteratively remove it, guided by a mathematical representation of your text prompt. Each step asks “given this noise and this description, what noise should I remove to move closer to the target?” Do that 20–35 times and something coherent emerges.

A few things that genuinely surprised me:

LoRAs are elegant. A Low-Rank Adaptation file is only 50–200MB but dramatically shifts a 22GB model’s behavior. It’s not retraining — it’s adding a lightweight bias layer. The analogy that clicked for me: the base model is a professional actor, the LoRA is a costume and character brief.

CFG scale matters differently for different models. Flux works best at 1.0–2.0 CFG, while SDXL typically needs 7–9. Running SDXL settings on Flux produces either blank images or artifacts. This one cost me a few failed runs before I understood why.

Apple MPS has real limitations. fp8 quantization — which most optimized Flux and Wan models use — isn’t supported on Metal. You either fall back to fp16 (larger memory footprint, works fine) or find non-quantized weights. With 64GB unified memory this isn’t a dealbreaker, but it means the “just download the optimized model and run it” path often needs adjustment on Apple Silicon.

Trigger words are non-negotiable. The whiteboard LoRA I’m using has trigger words STCKFGRCE_STYLE and A STICK FIGURE DRAWING. Without them in the prompt, the LoRA barely activates even when loaded. Learned this the hard way after a few confusingly generic outputs.


The Workflow In Practice
#

The goal is a single command that produces two finished videos:

python3 pipeline.py --topic "What is an Operating System?" --lang both --upload

That command should:

  1. Call the Claude API to generate a structured script with scene-by-scene image prompts
  2. Call the ComfyUI API to generate one image per scene
  3. Pass each scene’s narration text through Kokoro TTS
  4. Use FFmpeg to sync each image with its audio segment
  5. Concatenate all segments into a final video
  6. Upload both EN and ES versions to YouTube as Private drafts

On an M4 with 64GB, the estimate is around 12–18 minutes per video pair. That means running a batch of three topics overnight produces six videos — a full week of content for both channels.

The Wan2.1 video generation piece, once stable, would replace the static images with short animated clips, which would meaningfully differentiate the channel from the typical static-image explainer format.


Honest Reflections
#

A few things I’d do differently starting fresh:

Start with one tool, not five. The temptation is to set up the entire stack at once. The reality is that each component has its own quirks, and debugging is much easier when you isolate them. Get image generation working cleanly before touching voiceover. Get voiceover stable before touching video.

Firefly for prototyping, local for production. Adobe Firefly’s free tier produced noticeably higher quality whiteboard images than my local Flux + LoRA setup at current settings. For testing scripts and validating content, Firefly is faster and cleaner. For production volume, local is the right call — once dialed in.

The M4 is genuinely exceptional for this. Most local AI tutorials assume NVIDIA CUDA. Apple Silicon via MPS works, but needs adjustments. The 64GB unified memory architecture means you can hold a 22GB image model, an 11GB text encoder, a 47GB language model, and working memory all at once — something that would require multiple separate GPU cards on a PC setup.

Text in AI images is still unreliable. Every model struggles with rendering clean readable text inside generated images. The practical solution: generate the image without text, add text overlays in post using CapCut or Canva. It’s actually cleaner and gives you more typographic control anyway.


What’s Next
#

The pipeline is functional but not yet fully automated. Next steps:

  • Stabilize Wan2.1 video generation on MPS for animated clips
  • Fine-tune the ComfyUI workflow settings for better whiteboard consistency
  • Publish the first video and iterate based on actual performance data
  • Explore training a custom LoRA on my own style once the channel has a defined visual identity

The channel itself — both the English tech channel and the Spanish finance channel — are still being named and branded. That’s a separate post.

If you’re a DevOps person curious about AI content creation, the main takeaway is this: the tooling is more accessible than it looks, the learning curve is real but manageable, and having a powerful local machine changes what’s possible. You don’t need cloud APIs for any of this at scale.

All the scripts, workflows, and notes from this session will be linked from my GitLab once cleaned up.


This post is part of my public learning journal. Everything here represents real hands-on work, mistakes included.