Designing AI Systems
That Scale Beyond the Prototype
Beyond training accuracy — what production ML actually demands from your system.
Training an accurate model is the easy part. Turning it into a system that serves real users reliably, cheaply, and maintainably — that's the engineering challenge nobody teaches you.
According to multiple industry surveys, between 70–85% of ML models that are built never make it to production. Of those that do, a significant fraction are quietly deprecated within a year because they're too expensive to run, too brittle to maintain, or have degraded silently without anyone noticing.
The gap between a Jupyter notebook with 92% validation accuracy and a production AI system that reliably serves 100,000 requests per day is not a gap of model quality. It's a gap of systems thinking.
"A model is a function. A production AI system is a pipeline, an API, a data contract, a monitoring loop, and a team — with a model somewhere in the middle."
compareModel vs System: What Changes
In research, a model is evaluated on a fixed, clean, labeled dataset. You control everything: the data distribution, the evaluation metric, the runtime. A bad result means you retrain.
In production, the world is the dataset. Data arrives continuously, unlabeled, from real users who behave differently than your training distribution. Your evaluation metric is now business KPIs — not accuracy but click-through rate, churn reduction, or cost per decision. A bad result means user impact, revenue loss, or regulatory exposure.
- Static, clean training set
- Optimise for accuracy metric
- Single-shot evaluation
- Notebook → result → publish
- You control all inputs
- Continuous, unknown data stream
- Optimise for business outcomes
- 24/7 latency SLA
- API → pipeline → model → log
- Users send anything
schemaThe Anatomy of a Production AI System
A production AI system is not a model with a REST endpoint bolted on. It's composed of distinct, interacting subsystems — each with its own failure modes and scaling characteristics.
Each box is a separate engineering concern. The model is only one box — and often not the hardest one to build or maintain.
streamThe Data Pipeline: Where Most Systems Actually Break
The most common reason ML models fail in production isn't model error — it's data pipeline error. The model receives input that doesn't match its training distribution, produces garbage output, and nobody notices for days because there are no alerts on input quality.
Training vs Serving Skew
Training-serving skew occurs when the features computed at training time differ — even slightly — from those computed at serving time. A feature computed as "average of last 7 days" in training might be computed as "average of last 7 calendar days" in serving, introducing a systematic difference on weekends. The model silently underperforms on a class of inputs and no accuracy metric in the training pipeline catches it.
The Feature Store
A feature store solves training-serving skew at scale. Features are computed once, stored in a low-latency key-value store (Redis, DynamoDB), and retrieved identically by both training jobs and serving infrastructure. The training pipeline reads from the same feature store as the API — the computation path converges.
# Training: read historical features for model training features_df = feature_store.get_historical_features( entity_df=entity_df, # user IDs + timestamps feature_refs=["user_7d_avg_session", "user_device_type"] ) # Serving: read online features in real-time (<5ms) features = feature_store.get_online_features( entity_rows=[{"user_id": user_id}], feature_refs=["user_7d_avg_session", "user_device_type"] ) # Same feature refs, same values — skew eliminated
boltThe Inference Pipeline: Latency Is a Feature
A model that takes 800ms to return a prediction is useless in a real-time recommendation system where the page load SLA is 200ms. Latency isn't an implementation detail — it's a product requirement, and it determines what model architectures are even viable.
Where Latency Comes From
In a naive implementation, a single inference request might involve: a database read (feature lookup), model forward pass, a post-processing transformation, and a cache write. Each adds latency, and they're often done sequentially when they could be parallelised or eliminated.
(Redis, in-memory)
(ONNX / GPU batch)
+ cache write
Optimization Levers
- ONNX export: Converting a PyTorch model to ONNX and running it with ONNX Runtime typically yields 2–4× faster CPU inference than raw PyTorch, because the runtime applies graph optimizations and uses optimized BLAS kernels.
- Model quantization: Reducing weights from float32 to int8 cuts model size by 4× and speeds up inference — with usually less than 1% accuracy drop for classification tasks.
- Request batching: Grouping concurrent requests into a single batch improves GPU utilization dramatically. A GPU that processes 1 request at a time and 32 requests at a time takes nearly the same wall-clock time for the forward pass.
- Prediction caching: For inputs that repeat (same product ID, same user segment), caching model outputs in Redis with a short TTL eliminates redundant inference entirely.
warningWhy ML Models Fail in Production
The causes are remarkably consistent across companies and domains. Understanding them is more valuable than knowing how to fix any one of them, because they reveal the systematic gaps between research and production thinking.
| Failure Mode | Root Cause | Detection |
|---|---|---|
| Data drift | Input distribution shifts over time (seasonality, user behaviour change) | Monitor input feature statistics vs training baseline |
| Label shift | The relationship between inputs and outputs changes (e.g., fraud patterns evolve) | Business metric monitoring, delayed label collection |
| Training-serving skew | Features computed differently at train vs serve time | Log live feature values, compare to training distribution |
| Feedback loops | Model predictions affect future training data (echo chamber) | Causal analysis, exploration strategies (ε-greedy) |
| Silent errors | Bad predictions returned with high confidence, no alerting | Confidence calibration, anomaly detection on outputs |
monitor_heartMonitoring: The Feedback Loop That Keeps the System Honest
Software systems are monitored on uptime, latency, and error rate. AI systems need an additional dimension: prediction quality over time. A server that responds in 10ms but returns increasingly wrong predictions is not healthy — even if your SRE dashboard says it is.
What to Monitor
- Input data quality — null rates, value ranges, distribution of categorical features. Alert when a feature's average drifts beyond 2σ from the training baseline.
- Prediction distribution — the distribution of model outputs. A recommendation model that suddenly suggests the same 5 items to everyone has a problem that won't show in latency metrics.
- Business proxy metrics — click-through rate, conversion, churn. These lag by hours or days but are the ground truth for whether the model is serving its purpose.
- Ground truth when available — for systems with delayed labels (fraud detection, churn prediction), collect outcomes and compute rolling accuracy metrics weekly.
# Log every prediction with its input features prediction_log = { "timestamp": datetime.utcnow().isoformat(), "model_version": "v2.3.1", "input_features": features, # raw values "prediction": output, "confidence": float(prob.max()), "request_id": request_id, # for joining labels later } # Nightly job: compare live input distribution vs training for feature in monitored_features: drift_score = ks_test(live[feature], baseline[feature]) if drift_score.pvalue < 0.05: alert(f"Drift detected in {feature}: p={drift_score.pvalue:.3f}")
historyModel Versioning and Safe Rollout
Deploying a new model version is riskier than deploying new application code. A bug in a route handler shows up immediately. A model that performs worse on a specific demographic may take weeks of business metric analysis to surface.
This is why shadow mode and canary deployments are essential for ML systems specifically:
- Shadow mode: The new model runs in parallel with the old one. Both make predictions; only the old model's results are served. Compare outputs offline to find regressions before they affect users.
- Canary rollout: Route 5% of traffic to the new model. Monitor business metrics for 48–72 hours. If no regression, increase to 20%, 50%, 100%. Each step has an automated rollback trigger if the metrics drop.
- A/B testing: Split user populations by a consistent hash of user ID (not request ID), so the same user always gets the same model. Compute per-model business outcomes over a statistically valid period before deciding.
flagThe System Is the Product
An ML model is a component. A production AI system is an engineering discipline. The practitioners who bridge this gap — who can reason about latency SLAs as readily as loss functions, who design data pipelines with the same rigour as model architectures — are the ones who build AI products that actually matter.
The model is almost never the hard part. The hard part is building the pipeline that keeps it honest, the infrastructure that serves it cheaply, the monitoring that catches it silently failing, and the deployment process that lets you update it without fear. That's systems thinking applied to machine learning — and it's what separates a prototype from a product.
If this helped you or saved you some time, consider supporting my work.
emoji_objects Key Takeaways
- 70-85% of ML models never reach production — the gap is systems thinking, not model accuracy.
- Training-serving skew happens when features are computed differently at train vs serve time; a feature store eliminates it.
- Silent degradation is the most dangerous failure mode — drift detection alerts must be in from day one.
- ONNX export + int8 quantization typically gives 3–6× inference speedup with under 1% accuracy drop on classification tasks.
- Shadow mode before canary — run the new model in parallel first, compare outputs offline, then gradually shift live traffic.