A Practical, Production-Ready Tool for Summarizing YouTube Videos with LLMs and Agents
A Practical, Production-Ready Tool for Summarizing YouTube Videos with LLMs and Agents
This post is about a tool. It’s a CLI that uses LLM tool calling and a tiny agent loop to extract IDs, fetch transcripts/metadata, and produce high-quality summaries of long YouTube videos without tripping rate limits or context windows.
TL;DR
What it is | Why it’s useful | How it works | How to run |
---|---|---|---|
A CLI tool that summarizes YouTube videos using an LLM with tools/agents | Handles long videos and real-world edge cases (no transcript, quotas) | LangChain tool calling+ fixed/recursive chains + chunked map-reduce summarization + bare-LLM finalization | python youtube_tool_agent.py summarize --url "<video-url>" --language en |
What this tool does
-
Extracts YouTube video IDs (robust to
watch?v=
,youtu.be/
,embed/
) -
Fetches transcripts (and can be extended to fall back to captions/description via
yt-dlp
) -
Pulls metadata & thumbnails via
yt-dlp
(title, views, likes, chapters, image sizes) -
Searches YouTube by query (PyTube)
-
(Best-effort) Trending by region
-
Generates succinct, expert-oriented summaries of long videos using a chunked map-reduce flow that avoids TPM/context explosions
-
Includes a CLI and verbose tracing so you can see exactly what the agent is doing
Install & first run
You’ll see progress like:
Why agents for this problem?
A “just call an API and prompt” script breaks on real-world YouTube:
-
Some videos have no official transcript (only auto-captions or nothing).
-
Transcripts can be huge, exceeding model context or your org’s TPM limits.
-
Tool-calling models often respond with empty content + a request to call another tool, unless you orchestrate correctly.
An agent (even a tiny one) solves this:
-
It decides which tool to call and in what order (ID → transcript → metadata → summarize).
-
It loops until it has what it needs (recursive chain), then finalizes the answer.
-
You get traceability: every tool call is logged with payload sizes so you can reason about cost and latency.
Core ideas (architectural choices that matter)
1) Tool calling (LangChain)
We expose capabilities as annotated tools the model can call by name:
-
extract_video_id
,fetch_transcript
,get_full_metadata
,get_thumbnails
,search_youtube
,get_trending_videos
.
These functions return plain JSON/strings, which keeps the glue simple and makes debugging obvious.
Source code can be found here:
JordiCorbilla/langgraph-cookbook: langgraph-cookbook
2) Two orchestration modes
-
Fixed chain for deterministic summarize:
-
Extract ID → 2) Fetch transcript → 3) Summarize (chunked) → 4) Polish (bare LLM)
-
-
Universal (recursive) chain for open “ask”:
-
Keep calling tools while the model requests them. Stop when it returns text.
-
3) Chunked map-reduce summarization
Why: long transcripts blow up context and TPM.
How: split the transcript into overlapping chunks → summarize each chunk (map) → merge the bullet lists (reduce) → optional polish for clarity.
Key tunables:
-
Chunk size: 3k–4k chars (or larger if you have headroom)
-
Overlap: 200–300 chars
-
Short
sleep
between chunk calls to smooth TPM bursts
4) Bare-LLM finalization
Tool-calling models often return empty content when they want another tool. The final step uses a bare LLM (no tools bound). That forces a textual answer and prevents “one more tool?” loops.
Architecture at a glance
-
Why “bare” LLM at the end? Tool-calling models often return empty
content
when they still want to call tools. Finalizing with a no-tools model guarantees text.
CLI you’ll actually use
Point tools (fast, no LLM):
Deterministic, chunked summary:
Agentic “do what it takes”:
Under the hood (high-level)
Why the fallback? In production you’ll meet videos with no official transcript. Add a yt-dlp
captions fallback and, as a last resort, summarize metadata/description so you always produce something useful.
Operational tips (so it stays fast and cheap)
-
TPM discipline: throttle chunk calls with short sleeps; prefer smaller models for map steps, then a slightly stronger model for merge/polish.
-
Chapters-aware chunking: if
yt-dlp
returns chapters, chunk on chapter boundaries first—better topical coherence and fewer duplicates. -
Cache transcripts/metadata to disk (simple JSON files). You’ll save both time and money on reruns.
-
Observability: log tool names and payload sizes. When costs drift, you’ll know why.
Closing thoughts
This is a tool designed for the messiness of real YouTube content. The combination of tool calling, a minimal agent loop, chunked summarization, and bare-LLM finalization makes it predictable, debuggable, and resilient. You can drop it into a pipeline today and get reliable results on long videos.
Comments
Post a Comment