Skills for Codex: The missing layer between prompts and reliable workflows

April 12, 2026

Skills for Codex: The missing layer between prompts and reliable workflows

The real unlock for coding agents is packaging workflows into AGENTS.md, skills, and scripts

Most teams experimenting with coding agents are still operating at the wrong layer.

They keep trying to get reliability by writing better prompts. That helps a little. It does not solve the real problem.

The hard part is not telling the model to "be careful" or "think step by step." The hard part is turning repeated engineering workflows into artifacts the agent can reuse: repo instructions, narrowly scoped skills, and deterministic helper scripts.

That is what I built in this demo repository.

The repo is small on purpose. It contains a tiny legacy-style Python service, four concrete Codex skills, a short AGENTS.md, several Python scripts, and lightweight tests. The point is not to show off a toy prompt library. The point is to show what a serious workflow layer looks like when you want an agent to behave more like an engineer and less like an improv partner.

One-line takeaway: prompting is too ephemeral for repeated engineering work.

Another one: if a workflow matters more than once, it should probably exist as a repo artifact.

Why prompting alone breaks down

One-off prompting degrades for boring reasons:

The same task gets described slightly differently every time.
The agent re-derives the workflow from scratch.
Deterministic steps get handled as prose instead of code.
The model mixes operational guidance with local reasoning and drops part of it under pressure.

Senior engineers already know this pattern from human systems. If a workflow matters, we do not rely on memory alone. We add runbooks, scripts, templates, checks, and conventions.

Coding agents need the same thing.

"Please inspect the code carefully before changing anything" is not a reliable mechanism. It is a wish. A skill with explicit triggers, steps, outputs, and boundaries is a mechanism.

The missing layer: `AGENTS.md` + skills + scripts

The repo uses three layers, each doing a different job.

AGENTS.md is the repo-level contract. It tells Codex how work should be done here:

prefer skills for recurring workflows
use scripts for deterministic tasks
validate changes before calling them done
plan first when risk or ambiguity is non-trivial

That file is intentionally short. Repo instructions are more useful when they set constraints and decision rules, not when they try to micromanage every possible action.

The skills are the workflow layer. Each one has a sharp name, obvious routing description, trigger conditions, workflow steps, expected outputs, and boundaries. That matters because routing quality is part of the problem. If skill names are vague, the agent will miss them or apply them at the wrong time.

The scripts are there for the parts that should not depend on language generation at all. Parsing test failures, scanning a repo for hotspots, and extracting risk signals from a diff are all better handled with code than with a paragraph of instructions.

One-line takeaway: natural language is a poor substitute for a parser.

The example repo

The demo repo has a compact but realistic structure:

skills/ contains four engineering workflows
scripts/ contains Python helpers for deterministic scanning and summarization
demo/legacy_service/ contains a deliberately awkward Python service
demo/artifacts/ contains failing test output and a risky sample diff
tests/ validates the scripts

The service is small, but it has the right kind of mess:

module-level in-memory storage
a module-level cache
configuration read from environment at import time
time-sensitive pricing logic
hidden coupling between service, storage, and cache code

That is enough to make the skills meaningful. The repo is not pretending that agents only work on greenfield, well-factored code.

How to get the repo and run it

The repository is here:

https://github.com/JordiCorbilla/engineering-skills-for-codex

If you want the full example locally, clone it and run the lightweight validation:

git clone https://github.com/JordiCorbilla/engineering-skills-for-codex.git
cd engineering-skills-for-codex
python -m unittest discover -s tests -p "test_*.py" -v
python -m unittest discover -s demo/tests -p "test_*.py" -v

If make is available in your environment, the equivalent entry points are simpler:

make test
make validate

The skills themselves live under skills/. In practice, you use them by opening the repo in Codex and invoking the workflow by name when the task matches. For example:

ask Codex to use investigate-before-patching before editing demo/legacy_service/services/order_service.py
ask Codex to use summarize-test-failures against demo/artifacts/pytest_failure_output.txt
ask Codex to use legacy-codebase-recon on demo/legacy_service
ask Codex to use review-change-safely on demo/artifacts/sample_change.diff

The deterministic parts can also be run directly without the skill wrapper:

python scripts/inspect_patch_context.py --root demo --target demo/legacy_service/services/order_service.py
python scripts/summarize_test_failures.py --input demo/artifacts/pytest_failure_output.txt
python scripts/legacy_recon.py --root demo/legacy_service
python scripts/diff_review_summary.py --input demo/artifacts/sample_change.diff

If you want to reuse the skills in another repository, the practical starting point is to copy the relevant folder from skills/, bring over any supporting script from scripts/, and adapt AGENTS.md so the repo-level instructions match the codebase you actually have. Treat these skills as templates for repeatable workflows, not drop-in magic.

The skills

`investigate-before-patching`

This skill exists to stop speculative editing.

Before changing code, the workflow inspects the target module, local dependencies, related tests, and obvious coverage gaps. The helper script, scripts/inspect_patch_context.py, gives a deterministic first pass over imports, sibling modules, and nearby tests.

That sounds basic. It is also exactly the sort of discipline agents tend to skip when left alone. They jump to the patch because producing a patch is easier than earning one.

The important design choice here is scope. The skill does not promise a full call graph. It explicitly says the scan is a starting point and that critical modules still need to be read.

`summarize-test-failures`

This skill handles a very common failure mode in agent-driven work: drowning in raw test output.

The script ingests a test log, extracts parseable failures, clusters them by likely cause, and produces a debugging brief. In the demo, the sample pytest log gets reduced to three concrete buckets:

stale state or cache invalidation
pricing rule regression
configuration drift

That is already more useful than dumping forty lines of traceback back at the user.

The limit is important though: this is a triage tool, not a diagnosis engine. If the log is messy or incomplete, the summary should stay humble.

`legacy-codebase-recon`

This is the first-pass orientation skill for inherited systems.

The script looks for entry points and scores modules for hotspots using simple heuristics: mutable globals, environment coupling, cache usage, time-sensitive behavior, and general structural weight. On the demo service, it correctly flags the order service and repository layer as the riskiest modules.

This is exactly where skills can help: not by pretending to understand a legacy system instantly, but by providing a disciplined first pass that makes the next manual inspection smarter.

The failure mode is also obvious. Heuristics are directional. A hotspot score is not architecture truth.

`review-change-safely`

This skill is designed for reviewing diffs with an emphasis on unintended side effects.

The demo diff changes a function signature, alters the returned payload shape, and replaces a guarded key lookup with a required-key lookup. The script surfaces all three and also notes that no tests moved with the behavior change.

That is a useful baseline because many agent reviews default to a glorified summary. Serious review work needs findings, not narration.

The skill therefore forces a review shape:

findings first
severity and confidence
file references
residual risk

That structure matters. It is much easier to trust and act on than a free-form paragraph about what changed.

Why scripts matter

The scripts are not there because Python is exciting. They are there because some tasks are simply better when the machine is forced to be literal.

Three examples:

First, parsing failure lines. Asking a model to "read this pytest output and summarize it" works until the output is long, repetitive, or messy. A parser is cheap and stable.

Second, scanning code for risk signals. Telling the model to "look for tight coupling and hidden state" is too vague on its own. A script can make that scan explicit and repeatable, even if the results are only heuristic.

Third, diff review. Function signature changes, payload key changes, and test coverage gaps are not subtle concepts. They should be machine-detectable.

The right split is simple:

use scripts for deterministic extraction
use skills for workflow logic
use the model for judgment, prioritization, and synthesis

What worked well

The most useful part of the repo is the layering.

AGENTS.md gives the repo a stable operating model. The skills give recurring tasks structure and routing clarity. The scripts prevent a class of drift where every run invents a slightly different method. The tests keep the helpers from silently rotting.

The second useful part is restraint. The skills are short. They are not trying to encode every engineering principle. They are only trying to make a few repeated workflows reliable enough to be worth using.

One-line takeaway: a narrow skill you can trust is worth more than a broad skill you cannot.

What still does not work well

Skills do not remove the need for real engineering judgment.

They do not understand runtime behavior. They do not replace reading the code. They do not magically fix weak tests, misleading names, or missing context from the user. They also do not help much when the task is genuinely novel and there is no repeated workflow to package.

There is also a maintenance cost. Once you create a skill, you own it. If the repo changes and the skill instructions or script behavior drift, you have created a false sense of reliability.

This is why I would be cautious about building fifty skills too early. Most teams should start with a handful of painful, recurring workflows and make those solid.

When I would use this in production

I would use this pattern in production when:

the team is repeatedly using agents on the same codebase
the work includes bug triage, reviews, legacy recon, or change planning
there is enough repetition to justify a workflow artifact
the team cares more about consistency than about model theatrics

I would not bother in a repo where agent use is rare, the workflows are highly bespoke, or the codebase is changing so quickly that skill maintenance would dominate the value.

The threshold is not "do we use AI?" The threshold is "do we keep repeating the same engineering moves?"

Closing thoughts

The claim here is modest.

Coding agents do not become reliable because we found the perfect prompt. They become more useful when we stop forcing the model to reinvent stable workflows on every task.

That means packaging the workflow.

Write the repo contract in AGENTS.md. Turn repeated tasks into skills with real boundaries. Move deterministic steps into scripts. Test those scripts. Then let the model operate inside that frame.

That is less glamorous than prompt alchemy. It is also much closer to how engineering systems usually become dependable.

Search This Blog

Random thoughts on coding and technology

Skills for Codex: The missing layer between prompts and reliable workflows

Why prompting alone breaks down

The missing layer: `AGENTS.md` + skills + scripts

The example repo

How to get the repo and run it

The skills

`investigate-before-patching`

`summarize-test-failures`

`legacy-codebase-recon`

`review-change-safely`

Why scripts matter

What worked well

What still does not work well

When I would use this in production

Closing thoughts

Comments

Post a Comment

Popular Posts

Train Your Own LoRA with ComfyUI: A Step-by-Step Guide

Firebase Cloud Messaging with Delphi 10.1 Berlin update 2.

Skills for Codex: The missing layer between prompts and reliable workflows

Why prompting alone breaks down

The missing layer: AGENTS.md + skills + scripts

The example repo

How to get the repo and run it

The skills

investigate-before-patching

summarize-test-failures

legacy-codebase-recon

review-change-safely

Why scripts matter

What worked well

What still does not work well

When I would use this in production

Closing thoughts

Comments

Post a Comment

Popular Posts

Train Your Own LoRA with ComfyUI: A Step-by-Step Guide

Firebase Cloud Messaging with Delphi 10.1 Berlin update 2.

The missing layer: `AGENTS.md` + skills + scripts

`investigate-before-patching`

`summarize-test-failures`

`legacy-codebase-recon`

`review-change-safely`