Back to blog
    AI Engineering

    AI Prompt Management: Treating Prompts Like Code

    Steinn Labs··7 min read

    Key Takeaways

    • Prompts are infrastructure that need version control, testing, and rollback capabilities
    • Every prompt change should trigger evaluation against a test dataset
    • A/B testing prompts catches issues that test datasets miss
    • Langfuse with Git-based prompt files provides more flexibility than dedicated platforms

    Prompts Are Infrastructure

    When your product relies on AI, prompts are not strings. They are infrastructure. A small change in wording can dramatically alter model output, break formatting, or introduce regressions. Yet most teams treat prompts as hardcoded strings buried in application code.

    The Problems with Inline Prompts

    • No version history: who changed the prompt and when?
    • No testing: how do you know a prompt change did not break something?
    • No review process: prompt changes go through code review but reviewers cannot evaluate prompt quality
    • No rollback: if a prompt change causes issues, rolling back requires a code deployment

    Our Prompt Management Approach

    Separate Prompts from Code

    We store prompts in a dedicated directory with version-controlled files. Each prompt has a name, version, the prompt text, and metadata about what it does and when it was last evaluated.

    Prompt Testing Pipeline

    Every prompt change triggers an evaluation pipeline that runs the new prompt against our test dataset and compares results to the previous version. We flag regressions automatically and require human review for changes that affect more than 5% of test outputs.

    A/B Testing Prompts

    For critical prompts, we support A/B testing where a percentage of traffic uses the new prompt while the rest uses the current version. This catches issues that test datasets miss.

    Prompt Observability

    Every prompt execution is logged with the prompt version, input, output, latency, and token usage. This makes it easy to trace issues back to specific prompt versions and understand performance trends.

    Tools We Use

    Langfuse for observability, custom scripts for evaluation pipelines, and Git for version control. We have evaluated dedicated prompt management platforms like PromptLayer and Humanloop but found that a simple file-based approach with good tooling is more flexible.

    Frequently Asked Questions

    How should AI prompts be managed?

    Treat prompts as infrastructure: store them separately from code with version control, run evaluation pipelines on every change, support A/B testing, and log every execution with observability tooling.

    What tools are best for prompt management?

    Langfuse for observability, Git for version control, and custom evaluation scripts work well. File-based approaches with good tooling are often more flexible than dedicated platforms like PromptLayer.

    How do you test prompt changes?

    Run the new prompt against a test dataset and compare results to the previous version. Flag regressions automatically and require human review for changes affecting more than 5% of outputs.

    prompts
    prompt-engineering
    devops
    testing
    observability