The Four Serving Patterns
ML serving is not one problem—it's four distinct problems with fundamentally different constraints. Picking the wrong pattern is one of the most common and costly mistakes in production ML.
| Pattern | Latency | Throughput | Complexity |
|---|---|---|---|
| Online | < 100ms | Low–Medium | Medium |
| Batch | Minutes–Hours | Very High | Low |
| Streaming | Seconds | High | High |
| Embedded | < 1ms | Low | Low |
Pattern 1: Online (Synchronous) Serving
A client sends a request and waits for a prediction. The model runs on the hot path.
Architecture
Client → Load Balancer → Model Server (replicas) → Feature Store → Model
↓
Prediction Cache
When to use
- User-facing predictions (search ranking, recommendations, fraud detection)
- Latency SLA < 200ms
- Low to medium QPS (< 100K requests/second without horizontal scaling)
Implementation with FastAPI + vLLM pattern
# For traditional ML models
from fastapi import FastAPI
import joblib
import numpy as np
app = FastAPI()
model = joblib.load("model.pkl")
feature_client = FeatureStoreClient()
@app.post("/predict")
async def predict(request: PredictRequest):
# 1. Fetch features (often the bottleneck — use async)
features = await feature_client.get_features(
entity_id=request.user_id,
feature_names=FEATURE_LIST,
)
# 2. Build feature vector
X = np.array([[features[f] for f in FEATURE_LIST]])
# 3. Predict
prediction = model.predict_proba(X)[0][1]
return {"score": float(prediction), "model_version": MODEL_VERSION}
Latency budget breakdown (typical fraud detection)
Total budget: 100ms
Feature fetch: 40ms (network to feature store)
Model inference: 15ms (XGBoost or small NN)
Overhead: 10ms (serialization, routing)
Buffer: 35ms
Scaling for traffic spikes
Horizontal pod autoscaling based on GPU utilization (for neural models) or request queue depth:
# k8s HPA for model server
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Pattern 2: Batch Serving
Run predictions over a large dataset on a schedule. No client waiting.
Architecture
Scheduler → Batch Job → Read from data warehouse
→ Run model in parallel
→ Write predictions back to data warehouse / feature store
When to use
- Pre-computing scores for all users (email targeting, daily recommendations)
- High volume where online latency SLA can't be met
- Exploratory analysis or model evaluation
Implementation with Spark + MLflow
# batch_predict.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
import mlflow.pyfunc
import pandas as pd
spark = SparkSession.builder.appName("batch-inference").getOrCreate()
# Load model (broadcast to all executors)
model_uri = "models:/churn-model/production"
model = mlflow.pyfunc.load_model(model_uri)
broadcast_model = spark.sparkContext.broadcast(model)
@pandas_udf("double")
def predict_udf(features: pd.Series) -> pd.Series:
m = broadcast_model.value
X = pd.DataFrame(features.tolist())
return pd.Series(m.predict(X))
# Read from data warehouse
df = spark.read.parquet("s3://data-warehouse/users/features/date=2025-04-15/")
# Generate predictions
predictions = df.withColumn(
"churn_score",
predict_udf(df["feature_vector"]),
)
# Write back
predictions.select("user_id", "churn_score").write.mode("overwrite").parquet(
"s3://data-warehouse/predictions/churn/date=2025-04-15/"
)
Batch throughput tuning
- Vectorization: Use NumPy/Pandas batch predict, not single-row predict in a loop
- Parallelism: Spark partitions should equal 2–4× executor cores
- Memory: For large models, broadcast only if model fits in executor memory; otherwise load per-partition
Pattern 3: Streaming (Near-Real-Time) Serving
Events flow through a message queue; predictions are computed within seconds of new data arriving.
Architecture
Event Source → Kafka → Stream Processor (Flink/Spark Streaming)
↓
Feature Computation + Model Inference
↓
Output Topic → Downstream Systems
When to use
- Fraud detection where features depend on recent events (last 5 minutes of activity)
- Content moderation (flag posts within seconds of publishing)
- Real-time personalization with session-level features
The streaming feature problem
The hardest part of streaming inference isn't the model—it's feature computation. You need:
- Point-in-time correct features: No leakage from future events
- Windowed aggregations: "Number of transactions in last 10 minutes" requires stateful processing
- Low-latency feature store: Redis or DynamoDB, not BigQuery
# Flink Python API example
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import ProcessFunction
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(8)
class FraudScoringFunction(ProcessFunction):
def open(self, runtime_context):
import joblib
self.model = joblib.load("/models/fraud_v3.pkl")
self.feature_store = RedisFeatureStore(host="redis:6379")
def process_element(self, transaction, ctx):
# Fetch pre-computed user features from Redis
user_features = self.feature_store.get(transaction["user_id"])
# Combine with transaction features
X = build_feature_vector(transaction, user_features)
score = self.model.predict_proba([X])[0][1]
if score > 0.85:
yield {"transaction_id": transaction["id"], "action": "block", "score": score}
else:
yield {"transaction_id": transaction["id"], "action": "allow", "score": score}
transactions = env.from_source(kafka_source, ...)
results = transactions.process(FraudScoringFunction())
results.sink_to(kafka_sink)
env.execute("fraud-scoring")
Pattern 4: Embedded (Edge) Inference
The model runs inside the client application—no network call, no server.
When to use
- Mobile apps where latency or offline support matters
- Privacy requirements (data never leaves device)
- Cost: serving infrastructure is expensive at scale
Model formats
ONNX → Cross-platform, good for sklearn/PyTorch/TF models
TensorFlow Lite → Android/iOS, optimized for ARM
Core ML → iOS/macOS, hardware-accelerated on Apple Silicon
GGUF → LLMs on CPU/GPU via llama.cpp
ExecuTorch → PyTorch-native mobile deployment (Meta)
Exporting to ONNX
import torch
import onnx
model = SentimentClassifier()
model.load_state_dict(torch.load("model.pth"))
model.eval()
dummy_input = torch.randint(0, 32000, (1, 128)) # batch=1, seq_len=128
torch.onnx.export(
model,
dummy_input,
"sentiment.onnx",
input_names=["input_ids"],
output_names=["logits"],
dynamic_axes={"input_ids": {0: "batch", 1: "seq_len"}},
opset_version=17,
)
# Verify
onnx_model = onnx.load("sentiment.onnx")
onnx.checker.check_model(onnx_model)
Running with ONNX Runtime
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession(
"sentiment.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
# ~5–20ms on CPU for a small model
output = session.run(
["logits"],
{"input_ids": tokenized_input.numpy()},
)
Choosing the Right Pattern: Decision Tree
Is a human waiting for the result?
YES → Is the latency SLA < 500ms?
YES → Online serving
NO → Can you pre-compute? → YES → Batch (pre-compute before request)
NO → Are features time-sensitive (stale within minutes)?
YES → Streaming
NO → Batch (scheduled job)
Does data leave the device create unacceptable risk/cost?
YES → Embedded
Hybrid Patterns
Real systems often combine patterns:
- Batch + Online: Batch computes user embeddings nightly; online ranking uses them in real-time
- Streaming + Online: Streaming updates a real-time feature (recent activity count); online model reads it
- Embedded + Online: Embedded model handles common cases; online model handles edge cases the small model is uncertain about (cascading)
Design the full ML system end-to-end with our ML Systems Design Patterns guide.