Beyond Prompts: Building a Codex Playbook for Real Application Development

One-off prompting is weak engineering infrastructure.

It can be useful for getting unstuck, drafting code, or exploring ideas, but it does not produce repeatable engineering behavior on its own. The same prompt that works acceptably on one backend feature often degrades into generic advice or inconsistent implementation on the next one. That happens because a prompt is usually trying to do too many jobs at once: repository policy, architectural philosophy, feature workflow, review heuristics, and implementation detail.

This repository was built to separate those concerns and make them durable. 

JordiCorbilla/codex-engineering-playbook

The Core Problem

Most teams experimenting with coding agents start with giant instruction blobs. They pour style preferences, architecture opinions, code review heuristics, framework conventions, and edge-case warnings into a single file or prompt. That usually fails in predictable ways:

  • durable guidance gets buried under temporary workflow detail
  • the instructions become too large to scan and too vague to route reliably
  • project structure becomes secondary to the agent's last prompt
  • implementation becomes inconsistent across languages and application layers

The problem is not that the model needs more words. The problem is that the repository needs a better operating model.

What This Repository Builds Instead

This playbook uses four layers:

  1. AGENTS.md for durable repository-wide rules
  2. .agents/skills for reusable workflows
  3. templates and scripts for deterministic implementation support
  4. examples for concrete, inspectable reference projects

That split is the point of the repository.

The top-level AGENTS.md stays small. It says what the repository is for, when to prefer skills, what "done" means, and what kinds of validation are expected. It does not try to teach every implementation pattern directly. That restraint is important. Durable instructions should be hard to invalidate.

The detailed workflow logic moves into focused skills such as:

This is better than a giant style blob because routing matters. Reviewing a Python backend diff is not the same task as implementing a React feature or mapping an unfamiliar layered architecture. Putting those workflows into separate skills makes the instructions sharper and more reusable.

Why Durable Repo Instructions Matter

Agent behavior improves when the repository teaches stable defaults instead of relying on the operator to restate them every time.

In this repository, the durable defaults are simple:

  • inspect existing structure first
  • keep boundaries clear
  • put logic in the right layer
  • validate the touched area before calling the work done
  • explain trade-offs when architecture changes

Those rules are broad enough to matter across C#, Python, and React, but narrow enough to remain useful. They establish an operating model without pretending the languages are interchangeable.

Why Skills Beat Giant Prompt Files

A single giant prompt tends to flatten all engineering work into "follow best practices." That is almost always too soft. Real engineering work has modes, and the quality bar changes with the mode.

Feature work needs placement discipline. Review work needs risk detection. Contract review needs attention to caller-visible behavior. Architecture reconnaissance needs an accurate map, not rewrite advice.

That is why the skills in this repository are intentionally separate:

  • csharp-backend-feature cares about thin controllers, service-led behavior, async correctness, DTO boundaries, and targeted tests.
  • csharp-backend-review cares about fat controllers, sync-over-async mistakes, leaky entity boundaries, and weak exception flows.
  • python-backend-feature emphasizes typed boundaries, thin request handling, explicit services, and restrained abstraction.
  • frontend-ux-polish is about hierarchy, spacing, labels, state clarity, and affordances, not broad frontend architecture.

Separating those workflows does two things. It makes routing more reliable, and it avoids making every task carry the weight of every other task's instructions.

The C# Guidance

The C# portion of the playbook favors a layered ASP.NET Core style, but it is not blindly enterprise. The guidance in conventions/csharp.md pushes toward controllers, services, and repositories when those boundaries provide real value. It also says something many style guides avoid saying clearly: repository abstractions can become ceremony.

That trade-off matters. In the C# example app, the repository boundary exists because it helps keep persistence concerns separate and makes the service easy to test. But it stays intentionally small and in-memory. The goal is to demonstrate responsibility boundaries and async flow, not to perform ORM theater.

The example includes:

That is a good example of the repository's overall stance: use structure to clarify responsibilities, not to multiply files for their own sake.

The Python Guidance

Python code often swings between two bad extremes when teams talk about architecture. One extreme is route handlers full of business logic and IO. The other is an imported enterprise pattern that creates classes, abstract base types, and indirection that the codebase has not earned.

The Python guidance in conventions/python.md tries to avoid both.

It emphasizes:

  • thin request handling
  • typed boundaries
  • explicit validation
  • clear service or module ownership
  • simplicity over cleverness

In the Python example app, the route module translates HTTP concerns, the service owns the business path, and the data-access layer stores records in memory. The example is intentionally small, but the boundaries are inspectable:

That structure is useful because it makes future changes cheaper. A new validation rule or storage implementation has an obvious home.

The React + TypeScript Guidance

Frontend guidance is often too aesthetic or too abstract. This playbook is trying to be engineering guidance, so the React conventions focus on code structure, state handling, and contract discipline.

The priorities in conventions/react-typescript.md are:

  • keep components focused
  • separate API access from rendering
  • make loading, error, and empty states explicit
  • type props and async data honestly
  • keep hooks disciplined
  • avoid business logic hidden in JSX

The React example app is not a design system demo. It is a small application slice showing the separation between transport, state, and presentation:

That is deliberate. Many generated React examples hide the interesting part of the problem inside a single file. This one does not.

Why The Examples Matter

Example apps are not filler in a repository like this. They are operational anchors.

Without examples, convention documents drift toward abstract advice. Skills become more likely to overfit to generic framework knowledge. Templates become harder to evaluate because there is no concrete reference point for what "good" looks like in the repository.

The examples in this playbook make the guidance testable. They show the relationship between the documents and actual code. They also give Codex stable insertion points when exploring an unfamiliar but similarly structured codebase.

The practical run instructions live in the repository README. That split is intentional: the blog explains the design, while the README acts as the operational starting point.

Scripts And Templates As Engineering Tools

Another failure mode in agent-oriented repositories is over-explaining deterministic work in prose. If a C# controller shape is predictable, a template is better than another paragraph. If you need to discover likely insertion points or summarize changed files, a small script is usually better than asking the model to infer everything from scratch every time.

That is why this repository includes:

These are small on purpose. They exist to support repeatability, not to introduce another platform inside the repository.

Strengths

  • The repository is structured for reuse, not just demonstration.
  • The top-level instructions are durable enough to survive ordinary growth.
  • The skills are narrow enough to route reliably.
  • The examples make the guidance concrete instead of rhetorical.
  • The validation bar is lightweight but real.

Limitations

  • The playbook is intentionally opinionated, which means it will not fit every team culture.
  • The examples are small; they demonstrate boundaries, not the full complexity of production systems.
  • The repository does not solve framework-specific edge cases beyond the patterns it chooses to illustrate.
  • Skills can become stale if a team's real architecture evolves and the playbook does not.

There is also a rigidity trade-off. If a team treats the playbook as law rather than guidance, it can become a local maximum. Not every codebase needs controller/service/repository layering. Not every React feature needs another hook. Not every Python service needs another module boundary. Good playbooks create defaults, not dogma.

Where This Helps Most

This approach helps most when:

  • a team wants more repeatable coding-agent behavior
  • the codebase already values clear boundaries
  • multiple languages need a shared architectural vocabulary
  • implementation and review workflows should be routable and explicit

It helps less when:

  • the codebase is intentionally experimental
  • architecture changes weekly
  • framework-specific constraints dominate general engineering structure
  • the team wants maximal freedom and minimal convention

Pros And Cons

Pros:

  • better repeatability than one-off prompts
  • clearer separation between policy, workflow, and implementation detail
  • easier extension through focused skills
  • stronger grounding through examples and validation

Cons:

  • more repository structure to maintain
  • more upfront design work than dropping a prompt file into a repo
  • some duplication between conventions, skills, and examples is intentional and needs upkeep
  • can become too rigid if teams stop exercising judgment

Final View

The point of this repository is not to prove that Codex can follow instructions. The point is to make those instructions worth following.

That requires durable repository guidance, focused workflow skills, deterministic helpers, and concrete examples. It also requires saying no to the usual temptation to dump everything into a single mega-prompt and hope the agent will sort it out.

That temptation is understandable. It is also a weak substitute for engineering structure.

Comments