ML Model Serving Patterns: Online, Batch, Streaming, and Embedded Inference

The Four Serving Patterns

ML serving is not one problem—it's four distinct problems with fundamentally different constraints. Picking the wrong pattern is one of the most common and costly mistakes in production ML.

Pattern	Latency	Throughput	Complexity
Online	< 100ms	Low–Medium	Medium
Batch	Minutes–Hours	Very High	Low
Streaming	Seconds	High	High
Embedded	< 1ms	Low	Low

Pattern 1: Online (Synchronous) Serving

A client sends a request and waits for a prediction. The model runs on the hot path.

Architecture

Client → Load Balancer → Model Server (replicas) → Feature Store → Model
                                                           ↓
                                                    Prediction Cache

When to use

User-facing predictions (search ranking, recommendations, fraud detection)
Latency SLA < 200ms
Low to medium QPS (< 100K requests/second without horizontal scaling)

Implementation with FastAPI + vLLM pattern

# For traditional ML models
from fastapi import FastAPI
import joblib
import numpy as np

app = FastAPI()
model = joblib.load("model.pkl")
feature_client = FeatureStoreClient()

@app.post("/predict")
async def predict(request: PredictRequest):
    # 1. Fetch features (often the bottleneck — use async)
    features = await feature_client.get_features(
        entity_id=request.user_id,
        feature_names=FEATURE_LIST,
    )

    # 2. Build feature vector
    X = np.array([[features[f] for f in FEATURE_LIST]])

    # 3. Predict
    prediction = model.predict_proba(X)[0][1]

    return {"score": float(prediction), "model_version": MODEL_VERSION}

Latency budget breakdown (typical fraud detection)

Total budget:          100ms
  Feature fetch:        40ms  (network to feature store)
  Model inference:      15ms  (XGBoost or small NN)
  Overhead:             10ms  (serialization, routing)
  Buffer:               35ms

Scaling for traffic spikes

Horizontal pod autoscaling based on GPU utilization (for neural models) or request queue depth:

# k8s HPA for model server
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Pattern 2: Batch Serving

Run predictions over a large dataset on a schedule. No client waiting.

Architecture

Scheduler → Batch Job → Read from data warehouse
                      → Run model in parallel
                      → Write predictions back to data warehouse / feature store

When to use

Pre-computing scores for all users (email targeting, daily recommendations)
High volume where online latency SLA can't be met
Exploratory analysis or model evaluation

Implementation with Spark + MLflow

# batch_predict.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
import mlflow.pyfunc
import pandas as pd

spark = SparkSession.builder.appName("batch-inference").getOrCreate()

# Load model (broadcast to all executors)
model_uri = "models:/churn-model/production"
model = mlflow.pyfunc.load_model(model_uri)
broadcast_model = spark.sparkContext.broadcast(model)

@pandas_udf("double")
def predict_udf(features: pd.Series) -> pd.Series:
    m = broadcast_model.value
    X = pd.DataFrame(features.tolist())
    return pd.Series(m.predict(X))

# Read from data warehouse
df = spark.read.parquet("s3://data-warehouse/users/features/date=2025-04-15/")

# Generate predictions
predictions = df.withColumn(
    "churn_score",
    predict_udf(df["feature_vector"]),
)

# Write back
predictions.select("user_id", "churn_score").write.mode("overwrite").parquet(
    "s3://data-warehouse/predictions/churn/date=2025-04-15/"
)

Batch throughput tuning

Vectorization: Use NumPy/Pandas batch predict, not single-row predict in a loop
Parallelism: Spark partitions should equal 2–4× executor cores
Memory: For large models, broadcast only if model fits in executor memory; otherwise load per-partition

Pattern 3: Streaming (Near-Real-Time) Serving

Events flow through a message queue; predictions are computed within seconds of new data arriving.

Architecture

Event Source → Kafka → Stream Processor (Flink/Spark Streaming)
                              ↓
                    Feature Computation + Model Inference
                              ↓
                         Output Topic → Downstream Systems

When to use

Fraud detection where features depend on recent events (last 5 minutes of activity)
Content moderation (flag posts within seconds of publishing)
Real-time personalization with session-level features

The streaming feature problem

The hardest part of streaming inference isn't the model—it's feature computation. You need:

Point-in-time correct features: No leakage from future events
Windowed aggregations: "Number of transactions in last 10 minutes" requires stateful processing
Low-latency feature store: Redis or DynamoDB, not BigQuery

# Flink Python API example
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import ProcessFunction

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(8)

class FraudScoringFunction(ProcessFunction):
    def open(self, runtime_context):
        import joblib
        self.model = joblib.load("/models/fraud_v3.pkl")
        self.feature_store = RedisFeatureStore(host="redis:6379")

    def process_element(self, transaction, ctx):
        # Fetch pre-computed user features from Redis
        user_features = self.feature_store.get(transaction["user_id"])

        # Combine with transaction features
        X = build_feature_vector(transaction, user_features)

        score = self.model.predict_proba([X])[0][1]

        if score > 0.85:
            yield {"transaction_id": transaction["id"], "action": "block", "score": score}
        else:
            yield {"transaction_id": transaction["id"], "action": "allow", "score": score}

transactions = env.from_source(kafka_source, ...)
results = transactions.process(FraudScoringFunction())
results.sink_to(kafka_sink)

env.execute("fraud-scoring")

Pattern 4: Embedded (Edge) Inference

The model runs inside the client application—no network call, no server.

When to use

Mobile apps where latency or offline support matters
Privacy requirements (data never leaves device)
Cost: serving infrastructure is expensive at scale

Model formats

ONNX           → Cross-platform, good for sklearn/PyTorch/TF models
TensorFlow Lite → Android/iOS, optimized for ARM
Core ML        → iOS/macOS, hardware-accelerated on Apple Silicon
GGUF           → LLMs on CPU/GPU via llama.cpp
ExecuTorch     → PyTorch-native mobile deployment (Meta)

Exporting to ONNX

import torch
import onnx

model = SentimentClassifier()
model.load_state_dict(torch.load("model.pth"))
model.eval()

dummy_input = torch.randint(0, 32000, (1, 128))  # batch=1, seq_len=128

torch.onnx.export(
    model,
    dummy_input,
    "sentiment.onnx",
    input_names=["input_ids"],
    output_names=["logits"],
    dynamic_axes={"input_ids": {0: "batch", 1: "seq_len"}},
    opset_version=17,
)

# Verify
onnx_model = onnx.load("sentiment.onnx")
onnx.checker.check_model(onnx_model)

Running with ONNX Runtime

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "sentiment.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

# ~5–20ms on CPU for a small model
output = session.run(
    ["logits"],
    {"input_ids": tokenized_input.numpy()},
)

Choosing the Right Pattern: Decision Tree

Is a human waiting for the result?
  YES → Is the latency SLA < 500ms?
          YES → Online serving
          NO  → Can you pre-compute? → YES → Batch (pre-compute before request)
  NO  → Are features time-sensitive (stale within minutes)?
          YES → Streaming
          NO  → Batch (scheduled job)

Does data leave the device create unacceptable risk/cost?
  YES → Embedded

Hybrid Patterns

Real systems often combine patterns:

Batch + Online: Batch computes user embeddings nightly; online ranking uses them in real-time
Streaming + Online: Streaming updates a real-time feature (recent activity count); online model reads it
Embedded + Online: Embedded model handles common cases; online model handles edge cases the small model is uncertain about (cascading)

Design the full ML system end-to-end with our ML Systems Design Patterns guide.

ML Model Serving Patterns: Online, Batch, Streaming, and Embedded Inference

The Four Serving Patterns

Pattern 1: Online (Synchronous) Serving

Architecture

When to use

Implementation with FastAPI + vLLM pattern

Latency budget breakdown (typical fraud detection)

Scaling for traffic spikes

Pattern 2: Batch Serving

Architecture

When to use

Implementation with Spark + MLflow

Batch throughput tuning

Pattern 3: Streaming (Near-Real-Time) Serving

Architecture

When to use

The streaming feature problem

Pattern 4: Embedded (Edge) Inference

When to use

Model formats

Exporting to ONNX

Running with ONNX Runtime

Choosing the Right Pattern: Decision Tree

Hybrid Patterns

Related Articles

vLLM at LinkedIn: Optimizing LLM Inference at Scale

Reddit's ML Model Deployment and Serving Architecture

Towards Large-Scale Generative Ranking in Machine Learning

Want to Go Deeper?