A Practical, Production-Ready Tool for Summarizing YouTube Videos with LLMs and Agents

A Practical, Production-Ready Tool for Summarizing YouTube Videos with LLMs and Agents

This post is about a tool. It’s a CLI that uses LLM tool calling and a tiny agent loop to extract IDs, fetch transcripts/metadata, and produce high-quality summaries of long YouTube videos without tripping rate limits or context windows.


TL;DR

What it isWhy it’s usefulHow it worksHow to run
A CLI tool that summarizes YouTube videos using an LLM with tools/agentsHandles long videos and real-world edge cases (no transcript, quotas)LangChain tool calling+ fixed/recursive chains + chunked map-reduce summarization + bare-LLM finalizationpython youtube_tool_agent.py summarize --url "<video-url>" --language en

What this tool does

  • Extracts YouTube video IDs (robust to watch?v=, youtu.be/, embed/)

  • Fetches transcripts (and can be extended to fall back to captions/description via yt-dlp)

  • Pulls metadata & thumbnails via yt-dlp (title, views, likes, chapters, image sizes)

  • Searches YouTube by query (PyTube)

  • (Best-effort) Trending by region

  • Generates succinct, expert-oriented summaries of long videos using a chunked map-reduce flow that avoids TPM/context explosions

  • Includes a CLI and verbose tracing so you can see exactly what the agent is doing


Install & first run

# Python 3.11+ recommended pip install -r requirements.txt # Set your model key (OpenAI shown) export OPENAI_API_KEY="sk-..." # Windows PowerShell: $env:OPENAI_API_KEY="sk-..." # Summarize a video python youtube_tool_agent.py --verbose summarize \ --url "https://www.youtube.com/watch?v=8TJQhQ2GZ0Y" \ --language en

You’ll see progress like:

[LLM] OpenAI provider model=gpt-4o provider=openai [SUM] Chunk 1/2 (30000 chars) [SUM] Chunk 2/2 (22007 chars) <final polished summary printed here>

Why agents for this problem?

A “just call an API and prompt” script breaks on real-world YouTube:

  • Some videos have no official transcript (only auto-captions or nothing).

  • Transcripts can be huge, exceeding model context or your org’s TPM limits.

  • Tool-calling models often respond with empty content + a request to call another tool, unless you orchestrate correctly.

An agent (even a tiny one) solves this:

  • It decides which tool to call and in what order (ID → transcript → metadata → summarize).

  • It loops until it has what it needs (recursive chain), then finalizes the answer.

  • You get traceability: every tool call is logged with payload sizes so you can reason about cost and latency.


Core ideas (architectural choices that matter)

1) Tool calling (LangChain)

We expose capabilities as annotated tools the model can call by name:

  • extract_video_id, fetch_transcript, get_full_metadata, get_thumbnails, search_youtube, get_trending_videos.

These functions return plain JSON/strings, which keeps the glue simple and makes debugging obvious.

Source code can be found here: 

JordiCorbilla/langgraph-cookbook: langgraph-cookbook

2) Two orchestration modes

  • Fixed chain for deterministic summarize:

    1. Extract ID → 2) Fetch transcript → 3) Summarize (chunked) → 4) Polish (bare LLM)

  • Universal (recursive) chain for open “ask”:

    • Keep calling tools while the model requests them. Stop when it returns text.

3) Chunked map-reduce summarization

Why: long transcripts blow up context and TPM.
How: split the transcript into overlapping chunks → summarize each chunk (map) → merge the bullet lists (reduce) → optional polish for clarity.

Key tunables:

  • Chunk size: 3k–4k chars (or larger if you have headroom)

  • Overlap: 200–300 chars

  • Short sleep between chunk calls to smooth TPM bursts

4) Bare-LLM finalization

Tool-calling models often return empty content when they want another tool. The final step uses a bare LLM (no tools bound). That forces a textual answer and prevents “one more tool?” loops.

Architecture at a glance

┌──────────────┐ ┌───────────────────────┐ │ CLI (args) ├────▶ | Orchestrator Chain │ └─────┬────────┘ └──────────┬────────────┘ │ │ tool_calls │ ▼ │ ┌─────────────────────┐ │ │ Tools (LangChain) │ │ │ - extract_video_id │ │ │ - fetch_transcript │ │ │ - search_youtube │ │ │ - get_full_metadata│ │ │ - get_thumbnails │ │ └──────────┬──────────┘ │ │ results (text/JSON) │ ▼ │ ┌─────────────────┐ │ │ Bare LLM Call │ ← no-tools finalization │ │ (map→reduce→polish) ▼ └─────────────────┘ stdout (summary)

  • Why “bare” LLM at the end? Tool-calling models often return empty content when they still want to call tools. Finalizing with a no-tools model guarantees text.


CLI you’ll actually use

Point tools (fast, no LLM):

python youtube_tool_agent.py metadata --url "https://youtu.be/8TJQhQ2GZ0Y" python youtube_tool_agent.py thumbnails --url "https://youtu.be/8TJQhQ2GZ0Y" python youtube_tool_agent.py search --query "Retrieval-Augmented Generation" python youtube_tool_agent.py trending --region GB

Deterministic, chunked summary:

python youtube_tool_agent.py --verbose summarize \ --url "https://www.youtube.com/watch?v=8TJQhQ2GZ0Y" \ --language en

Agentic “do what it takes”:

python youtube_tool_agent.py --verbose ask \ --query "Summarize this YouTube video and explain the leveraging strategy (en): https://www.youtube.com/watch?v=8TJQhQ2GZ0Y"

Under the hood (high-level)

CLIOrchestrator ├─ Fixed chain (summarize): IDtranscriptmapreducepolish (bare LLM) └─ Universal chain (ask): recurse while the model returns tool_calls Tools (LangChain @tool) ├─ extract_video_id ├─ fetch_transcript (YouTubeTranscriptAPI; optional yt-dlp captions fallback) ├─ get_full_metadata, get_thumbnails ├─ search_youtube, get_trending_videos (best-effort) Finalization └─ Bare LLM (no tools) → guaranteed text

Why the fallback? In production you’ll meet videos with no official transcript. Add a yt-dlp captions fallback and, as a last resort, summarize metadata/description so you always produce something useful.


Operational tips (so it stays fast and cheap)

  • TPM discipline: throttle chunk calls with short sleeps; prefer smaller models for map steps, then a slightly stronger model for merge/polish.

  • Chapters-aware chunking: if yt-dlp returns chapters, chunk on chapter boundaries first—better topical coherence and fewer duplicates.

  • Cache transcripts/metadata to disk (simple JSON files). You’ll save both time and money on reruns.

  • Observability: log tool names and payload sizes. When costs drift, you’ll know why.


Closing thoughts

This is a tool designed for the messiness of real YouTube content. The combination of tool calling, a minimal agent loop, chunked summarization, and bare-LLM finalization makes it predictable, debuggable, and resilient. You can drop it into a pipeline today and get reliable results on long videos.

Comments

Popular Posts