Skills for Codex: The missing layer between prompts and reliable workflows
AGENTS.md, skills, and scriptsMost teams experimenting with coding agents are still operating at the wrong layer.
They keep trying to get reliability by writing better prompts. That helps a little. It does not solve the real problem.
The hard part is not telling the model to "be careful" or "think step by step." The hard part is turning repeated engineering workflows into artifacts the agent can reuse: repo instructions, narrowly scoped skills, and deterministic helper scripts.
That is what I built in this demo repository.
The repo is small on purpose. It contains a tiny legacy-style Python service, four concrete Codex skills, a short AGENTS.md, several Python scripts, and lightweight tests. The point is not to show off a toy prompt library. The point is to show what a serious workflow layer looks like when you want an agent to behave more like an engineer and less like an improv partner.
One-line takeaway: prompting is too ephemeral for repeated engineering work.
Another one: if a workflow matters more than once, it should probably exist as a repo artifact.
Why prompting alone breaks down
One-off prompting degrades for boring reasons:
- The same task gets described slightly differently every time.
- The agent re-derives the workflow from scratch.
- Deterministic steps get handled as prose instead of code.
- The model mixes operational guidance with local reasoning and drops part of it under pressure.
Senior engineers already know this pattern from human systems. If a workflow matters, we do not rely on memory alone. We add runbooks, scripts, templates, checks, and conventions.
Coding agents need the same thing.
"Please inspect the code carefully before changing anything" is not a reliable mechanism. It is a wish. A skill with explicit triggers, steps, outputs, and boundaries is a mechanism.
The missing layer: AGENTS.md + skills + scripts
The repo uses three layers, each doing a different job.
AGENTS.md is the repo-level contract. It tells Codex how work should be done here:
- prefer skills for recurring workflows
- use scripts for deterministic tasks
- validate changes before calling them done
- plan first when risk or ambiguity is non-trivial
That file is intentionally short. Repo instructions are more useful when they set constraints and decision rules, not when they try to micromanage every possible action.
The skills are the workflow layer. Each one has a sharp name, obvious routing description, trigger conditions, workflow steps, expected outputs, and boundaries. That matters because routing quality is part of the problem. If skill names are vague, the agent will miss them or apply them at the wrong time.
The scripts are there for the parts that should not depend on language generation at all. Parsing test failures, scanning a repo for hotspots, and extracting risk signals from a diff are all better handled with code than with a paragraph of instructions.
One-line takeaway: natural language is a poor substitute for a parser.
The example repo
The demo repo has a compact but realistic structure:
skills/contains four engineering workflowsscripts/contains Python helpers for deterministic scanning and summarizationdemo/legacy_service/contains a deliberately awkward Python servicedemo/artifacts/contains failing test output and a risky sample difftests/validates the scripts
The service is small, but it has the right kind of mess:
- module-level in-memory storage
- a module-level cache
- configuration read from environment at import time
- time-sensitive pricing logic
- hidden coupling between service, storage, and cache code
That is enough to make the skills meaningful. The repo is not pretending that agents only work on greenfield, well-factored code.
How to get the repo and run it
The repository is here:
https://github.com/JordiCorbilla/engineering-skills-for-codex
If you want the full example locally, clone it and run the lightweight validation:
git clone https://github.com/JordiCorbilla/engineering-skills-for-codex.git
cd engineering-skills-for-codex
python -m unittest discover -s tests -p "test_*.py" -v
python -m unittest discover -s demo/tests -p "test_*.py" -v
If make is available in your environment, the equivalent entry points are simpler:
make test
make validate
The skills themselves live under skills/. In practice, you use them by opening the repo in Codex and invoking the workflow by name when the task matches. For example:
- ask Codex to use
investigate-before-patchingbefore editingdemo/legacy_service/services/order_service.py - ask Codex to use
summarize-test-failuresagainstdemo/artifacts/pytest_failure_output.txt - ask Codex to use
legacy-codebase-reconondemo/legacy_service - ask Codex to use
review-change-safelyondemo/artifacts/sample_change.diff
The deterministic parts can also be run directly without the skill wrapper:
python scripts/inspect_patch_context.py --root demo --target demo/legacy_service/services/order_service.py
python scripts/summarize_test_failures.py --input demo/artifacts/pytest_failure_output.txt
python scripts/legacy_recon.py --root demo/legacy_service
python scripts/diff_review_summary.py --input demo/artifacts/sample_change.diff
If you want to reuse the skills in another repository, the practical starting point is to copy the relevant folder from skills/, bring over any supporting script from scripts/, and adapt AGENTS.md so the repo-level instructions match the codebase you actually have. Treat these skills as templates for repeatable workflows, not drop-in magic.
The skills
investigate-before-patching
This skill exists to stop speculative editing.
Before changing code, the workflow inspects the target module, local dependencies, related tests, and obvious coverage gaps. The helper script, scripts/inspect_patch_context.py, gives a deterministic first pass over imports, sibling modules, and nearby tests.
That sounds basic. It is also exactly the sort of discipline agents tend to skip when left alone. They jump to the patch because producing a patch is easier than earning one.
The important design choice here is scope. The skill does not promise a full call graph. It explicitly says the scan is a starting point and that critical modules still need to be read.
summarize-test-failures
This skill handles a very common failure mode in agent-driven work: drowning in raw test output.
The script ingests a test log, extracts parseable failures, clusters them by likely cause, and produces a debugging brief. In the demo, the sample pytest log gets reduced to three concrete buckets:
- stale state or cache invalidation
- pricing rule regression
- configuration drift
That is already more useful than dumping forty lines of traceback back at the user.
The limit is important though: this is a triage tool, not a diagnosis engine. If the log is messy or incomplete, the summary should stay humble.
legacy-codebase-recon
This is the first-pass orientation skill for inherited systems.
The script looks for entry points and scores modules for hotspots using simple heuristics: mutable globals, environment coupling, cache usage, time-sensitive behavior, and general structural weight. On the demo service, it correctly flags the order service and repository layer as the riskiest modules.
This is exactly where skills can help: not by pretending to understand a legacy system instantly, but by providing a disciplined first pass that makes the next manual inspection smarter.
The failure mode is also obvious. Heuristics are directional. A hotspot score is not architecture truth.
review-change-safely
This skill is designed for reviewing diffs with an emphasis on unintended side effects.
The demo diff changes a function signature, alters the returned payload shape, and replaces a guarded key lookup with a required-key lookup. The script surfaces all three and also notes that no tests moved with the behavior change.
That is a useful baseline because many agent reviews default to a glorified summary. Serious review work needs findings, not narration.
The skill therefore forces a review shape:
- findings first
- severity and confidence
- file references
- residual risk
That structure matters. It is much easier to trust and act on than a free-form paragraph about what changed.
Why scripts matter
The scripts are not there because Python is exciting. They are there because some tasks are simply better when the machine is forced to be literal.
Three examples:
First, parsing failure lines. Asking a model to "read this pytest output and summarize it" works until the output is long, repetitive, or messy. A parser is cheap and stable.
Second, scanning code for risk signals. Telling the model to "look for tight coupling and hidden state" is too vague on its own. A script can make that scan explicit and repeatable, even if the results are only heuristic.
Third, diff review. Function signature changes, payload key changes, and test coverage gaps are not subtle concepts. They should be machine-detectable.
The right split is simple:
- use scripts for deterministic extraction
- use skills for workflow logic
- use the model for judgment, prioritization, and synthesis
What worked well
The most useful part of the repo is the layering.
AGENTS.md gives the repo a stable operating model. The skills give recurring tasks structure and routing clarity. The scripts prevent a class of drift where every run invents a slightly different method. The tests keep the helpers from silently rotting.
The second useful part is restraint. The skills are short. They are not trying to encode every engineering principle. They are only trying to make a few repeated workflows reliable enough to be worth using.
One-line takeaway: a narrow skill you can trust is worth more than a broad skill you cannot.
What still does not work well
Skills do not remove the need for real engineering judgment.
They do not understand runtime behavior. They do not replace reading the code. They do not magically fix weak tests, misleading names, or missing context from the user. They also do not help much when the task is genuinely novel and there is no repeated workflow to package.
There is also a maintenance cost. Once you create a skill, you own it. If the repo changes and the skill instructions or script behavior drift, you have created a false sense of reliability.
This is why I would be cautious about building fifty skills too early. Most teams should start with a handful of painful, recurring workflows and make those solid.
When I would use this in production
I would use this pattern in production when:
- the team is repeatedly using agents on the same codebase
- the work includes bug triage, reviews, legacy recon, or change planning
- there is enough repetition to justify a workflow artifact
- the team cares more about consistency than about model theatrics
I would not bother in a repo where agent use is rare, the workflows are highly bespoke, or the codebase is changing so quickly that skill maintenance would dominate the value.
The threshold is not "do we use AI?" The threshold is "do we keep repeating the same engineering moves?"
Closing thoughts
The claim here is modest.
Coding agents do not become reliable because we found the perfect prompt. They become more useful when we stop forcing the model to reinvent stable workflows on every task.
That means packaging the workflow.
Write the repo contract in AGENTS.md. Turn repeated tasks into skills with real boundaries. Move deterministic steps into scripts. Test those scripts. Then let the model operate inside that frame.
That is less glamorous than prompt alchemy. It is also much closer to how engineering systems usually become dependable.

Comments
Post a Comment