Prompts are code. This observation is not novel — people have been making it since early 2023 — but the organizational and tooling implications still haven't fully landed in most enterprise AI teams. A system prompt running in production is a piece of logic that determines how an AI system behaves across potentially millions of interactions. Changing it changes the system's behavior. A bad change can degrade performance, create compliance exposure, or produce outputs that are wrong in subtle ways that don't immediately surface in monitoring. And yet, in most organizations, prompts are managed with less rigor than any other piece of production software — edited directly in production, with no version control, no testing, and no deployment review process.
The reason this gap persists is partly cultural. "Prompt engineering" emerged as a discipline that looked more like creative writing than software engineering. The skillset emphasized iteration, intuition, and the ability to coax desired behavior from a model through careful phrasing — not formal testing, version control, and deployment pipelines. That framing was appropriate when prompts were experiments. It's inappropriate when prompts are running in critical business processes at scale.
What a production prompt management discipline looks like
Version control as a baseline requirement. Every prompt running in production should be version-controlled with the same rigor as application code. This means: each prompt has an explicit version identifier, changes are reviewed before deployment, the history of what was changed and why is preserved, and rollback is possible when a change degrades performance. None of this is technically complex. It requires organizational policy, not sophisticated tooling. The fact that most teams don't do it reflects how recently prompts entered the "production software" category, not a genuine barrier to adoption.
Automated evaluation before deployment. The equivalent of a test suite for prompts is an evaluation set: a collection of inputs with expected outputs or quality criteria, run against the new prompt version before it's deployed. Building and maintaining this evaluation set is work — it requires curating examples that cover the range of inputs the system will actually see, including edge cases and adversarial inputs. But it's the right work. Teams that run evaluation before every prompt change catch regressions before users do. Teams that don't find out about regressions from user complaints.
The infrastructure investment thesis here is that the tooling category for prompt management — versioning, evaluation, deployment pipelines, A/B testing, performance monitoring over time — is underbuilt relative to the importance of prompt management in production AI systems. The companies building this infrastructure correctly are building it as the software engineering equivalent of CI/CD for AI applications. That positioning is exactly right, and it's where we expect the durable companies in this category to emerge.