June 2025

Prompt to Production

Turn a useful prompt into code, tests, interfaces, and operations.

The request came in as a familiar shape: a senior reviewer had built something in ChatGPT that saved hours on batch record checks. Leadership wanted to roll it out. Quality wanted to know who ran what, when, and on which data. IT wanted something in the repo, not a shared browser tab.

The prompt itself was fine. It extracted fields from PDF batch records, flagged missing signatures, and summarized deviations in plain language. Run it three times on the same file and you might get three slightly different summaries. That was acceptable for exploration. It was not acceptable for anything that would show up in an inspection.

What the demo hid

Demos hide repetition. In a meeting you run one file, skim the output, nod, and move on. Production means the same job on four hundred files this month and four hundred more next month. It means someone on night shift gets a different model behavior than day shift because the vendor shipped a silent update. It means a reviewer needs to reopen a case from March and see exactly what the system said then, not what it would say today.

The team had no version on the prompt. No log of inputs and outputs tied to record IDs. No way to override a bad extraction without copying text into email. Approvals happened outside the tool. That is not a production system. It is a very fast intern with no employee ID.

What we built instead

We kept the useful part and replaced the runtime. The prompt became a versioned template stored in git, not a string in someone's notebook. A small Django app pulled PDFs from the existing document store, called the model through a fixed API wrapper, and wrote structured results to Postgres. Every run got a row: file hash, model version, prompt version, timestamp, raw response, parsed fields.

Reviewers worked in a queue, not a chat window. They saw extracted fields beside the PDF, edited what was wrong, and signed off. Exceptions routed to a second reviewer. Nothing marked complete until a human with the right role clicked approve. The model still did the tedious reading. The database owned the truth.

Tests were boring on purpose. Golden files with expected extractions. CI failed if a prompt change shifted outputs without an explicit version bump. That annoyed people who wanted to tweak wording daily. It also meant audit could ask what changed between release 1.4 and 1.5 and get an answer in one query.

What I would do again

Start from the audit question, not the prompt. Ask what record someone needs to reconstruct six months later. Build that first. Then wire the model in as a step with clear inputs and outputs.

Keep the first production scope narrow. We did not try to automate final release decisions. We automated read-and-structure, then let humans keep judgment where the SOP required it. That shipped in ten weeks. A bigger agent fantasy would still be in slides.

Treat the prompt like a dependency. Semver it. Review changes like code. When the model vendor updates, run the golden files and look at the diff. If you would not deploy a library upgrade without tests, do not deploy a prompt change without them either.

When a prompt is enough

Sometimes the answer is still a prompt. Ad hoc analysis, one-off investigations, personal productivity. The line is ownership and repetition. If the output becomes a record, touches a regulated process, or runs unattended, it needs code around it.

The QC team still uses AI every day. They just do not bet the audit trail on a chat session. That is the whole point of prompt to production: keep what works, put it inside software your team owns, and know exactly what ran.