Same DevOps, new gremlins | ustwo engineering blog

Running AI in production sounds like a whole new discipline. It isn't. It's the same DevOps loop: ship, watch, set targets, respond. What's wider now is the set of things that can go wrong inside each step.

This applies whether AI is the core of your product or just one feature in a bigger system, and whether you're calling an externally hosted model or running your own.

Everything Google's Site Reliability Engineering book describes about running software in production still applies. It just has to handle systems that didn't exist when it was written. Here's where things have changed the most.

The same four DevOps stages, with the artefacts and signals each one now has to carry

What DevOps is, and what it has never been

DevOps is a cultural and technical practice for shipping software you can stand behind. Monitoring, change management, and blameless postmortems all sit under it.

It's been adapting to new kinds of systems for nearly two decades. Error budgets were popularised by Google's Site Reliability Engineering (SRE) book in 2016. GitOps came with Kubernetes. Platform engineering came with Team Topologies. With each wave, the same disciplines had to stretch to cover a new class of artefact.

AI is the latest of those waves. Post-ChatGPT, SRE, GitOps, and the rest still apply. The artefacts have changed: prompts, model versions, search indexes, not just code.

Shipping change

The classic CI/CD pipeline tests, builds, lints, scans, and ships. None of that goes away when you add a model to a product, whether it's Spotify's AI-generated personal podcasts, Notion's Q&A over your own docs, or Stripe's plain-English-to-SQL assistant. What changes is what gets shipped. It now includes a system prompt, the model it calls, and a search index. None of these are code in the sense the pipeline expects.

The pipeline was never designed to test them. A linter won't catch a change that makes the model's outputs worse, even though subtle formatting changes can swing accuracy by up to 76 percentage points. A unit test won't notice when you update to a new model version, even though model updates can break things that were working fine. Standard checks can't test the search index either. In one study, just changing how documents are divided up for search moved correct answers from 13% to 50%.

Three adaptations have stuck.

Evals are automated tests for AI behaviour. They check whether the model's output is good enough before a change reaches production.

They usually run in two places. Offline evals run in CI against a fixed set of test inputs with known expected answers. The output is then scored using pass/fail rules, a numeric scale, or another model acting as a judge. Online evals run against sampled live traffic to catch issues that only appear in production.

For systems that use retrieval, evals often need to check two things separately. First, did the system retrieve the right documents and ignore the irrelevant ones? Second, did the model answer based on that retrieved context, avoid making things up, and return the response in the expected format?

Ideally, evals would run on every pull request. In practice, they can become expensive at scale, so teams often run them on the changes most likely to affect behaviour: model swaps, prompt updates, and retrieval configuration changes.

GitHub's Copilot team, for example, runs more than four thousand offline tests before shipping a model. They also run over a thousand technical questions through Copilot Chat, with complex answers scored by a model judge. Tools like Promptfoo, Inspect, and Langfuse can plug this kind of evaluation into a CI step today.

Versioning the things that aren't code. Prompts, which model to call, search configurations, and scoring criteria all affect behaviour when they're updated. They go in version control, ship through pull requests, and each one runs the tests. DoorDash describes a loop where engineers work on prompts against datasets until the eval scores are good enough to ship.

Checking more than tests before a release. The deploy decision used to be just "does this pass tests." Now it weighs eval scores, cost, latency, and behavioural drift. A prompt change that doubles token cost shows up in the pull request like any other regression. The subtle issues are the hardest to catch: the model starting to ignore the expected output format, refusing more requests than it used to, or shifting tone. A deploy gate ties the signals together:

# evals/deploy_gate.py
from langfuse import Langfuse

langfuse = Langfuse()

# Get the production version of the system prompt
prompt = langfuse.get_prompt("assistant-system")

# Get the test inputs with known correct answers
dataset = langfuse.get_dataset("assistant-golden")

# Run the model on each test, then score each answer with another model acting as the grader
results = score_responses(prompt, dataset, grader="assistant-criteria")

# Stop the deploy if the average score is too low
if results.average_score < 0.85:
    raise SystemExit("Score below threshold, blocking deploy")

A Langfuse-backed deploy gate.

Watching production

Traditional observability catches things that fail loudly: server errors, latency spikes, error rates climbing. The trouble is that AI systems can pass every one of those checks while quietly returning wrong answers. If your model starts saying "I can't help with that" to 8% of queries when last week it was 3%, none of your existing alerts will fire.

The signals that matter look different. The list now includes how much the model processes per call, latency, how often it declines to answer, whether the output matches the expected format, and quality scores from sampled live traffic. The observability layer also has to catch the changes nobody made: a model provider quietly updating the model itself, the search index drifting because the underlying content changed, the slow decline the automated tests didn't catch.

When one of those signals moves, a trace shows where in the pipeline things went wrong. A trace records how one request moves through the system. For an AI feature, that now includes more steps: building the prompt, retrieving relevant documents, calling tools or APIs, the model call itself, and parsing the response. Tools like Phoenix and OpenLLMetry sit on top of OpenTelemetry and add AI-specific tracking. We've covered the foundations elsewhere on this blog.

Two teams have written about catching this kind of failure. LinkedIn found their YAML outputs failing around 10% of the time. A parser that fixes common formatting mistakes brought that to roughly 0.01%. Netflix's payments ML team had an alert flag their routing model consistently reducing traffic to a particular route every Tuesday. Their explanation system traced it to two features the model had learned from a previous Tuesday outage. These tools surface which inputs had the most influence on a given prediction.

Catching the regression is only half the job. You also need to know how much degradation is worth acting on.

What counts as good enough

That threshold has a name: the SLO (service level objective, the target performance the team sets for the system). Traditional SLOs cover availability and latency. "99.9% of requests succeed" and "p99 under 500ms" each give you a clean yes/no. AI quality doesn't work that way, because there's no single right answer to measure against.

Instead, teams target ranges rather than exact thresholds. A team might set a quality score above 0.8 on 95% of answers, keep the rate at which the model invents facts below 2%, or a cap on how often the model declines to answer. Each dimension gets an error budget: an agreed tolerance for how often the system can fall short. That budget determines how fast the team can ship. Burn it too fast and they stop shipping until it recovers.

Google's SRE practice puts this plainly: 100% is the wrong reliability target for basically everything, because users can't tell the difference between 100% and 99.999% availability. Other things in the path (their device, their network) are already less reliable than that. The error-budget model is how you put a number on that tolerance.

Putting numbers on quality forces the conversation into the open. "We're going to spend a quarter of our hallucination budget on this prompt change" is a claim you can push back on in a code review. "We think it'll be fine" isn't. When the quality score drifts below the SLO threshold, that's when someone gets paged to look. When it fires, the team moves into incident response.

Responding when something breaks

With traditional code, the first question after an incident is "what commit broke this?" The list of suspects is short. With AI systems it's wider:

Recent commits
An unannounced provider update that changed how the model responds
A prompt change that made things worse without the tests catching it
A search index that's drifted because the underlying content changed
Users asking things the system wasn't built for

Playbooks are the guides on-call engineers use during an incident. For AI systems they need a triage question first: which of the above caused this? Then the diagnostic steps: replay recent traffic against the previous model version, run the tests against a fresh production sample, compare this week's system prompts against last week's.

Postmortems get harder for the same reason. A postmortem is what the team writes up after an incident: what happened, why, and what to change. When traditional software breaks, there's usually a concrete artefact to work from: a stacktrace (the chain of function calls that led to the crash). When there's no stacktrace, the follow-ups look different: add a test case, broaden the test inputs, refine the scoring criteria, or tighten the SLO. The learning still happens, just somewhere different in the cycle.

Security widens the same way

The expanded artefacts bring new threats too. These don't come from DevOps. They exist because the model treats instructions and content the same way. Any system that uses an LLM picks up risks like prompt injection, retrieval poisoning, tool misuse, and context leakage between users. OWASP's LLM Top 10 catalogues them. The DevOps loop has to catch them too: scanned in CI, watched in production, gated at release, responded to like any other incident.

Wrapping up

The AI systems that hold up in production aren't the ones with the smartest model behind them. They're the ones where the team did the boring work: added an eval step to CI, added monitoring to each model call as its own tracked step, added a hallucination target to the SLO alongside latency, and updated the playbook so the on-call engineer knows to check whether the provider quietly swapped the model. None of it matters unless it shows up in what users actually do with the system.

None of that is a new discipline. It's the same DevOps, applied to a much wider set of artefacts than it was originally built for.