tutorial 2025-04-22 15 min read

Building Multimodal LLM Systems in Production: Vision, Audio, and Beyond

A practical engineering guide to deploying multimodal LLMs—handling image, audio, and video inputs alongside text—with architecture patterns and cost optimization strategies.

multimodal vision LLM GPT-4V Claude Gemini production cost optimization

Why Multimodal Adds Complexity

Adding image or audio inputs to an LLM-powered system introduces challenges that don't exist in text-only systems:

  1. Input size variance: A 1080p image is 6MB. A 1-hour audio file is 50MB. This breaks most text-oriented architectures.
  2. Cost structure changes: Vision tokens are 3–10× more expensive than text tokens per equivalent "information unit"
  3. Preprocessing pipelines: Images need resizing, audio needs chunking and transcription, video needs frame extraction
  4. Latency: Image encoding adds 100–500ms depending on model and resolution

This guide covers the engineering decisions that matter.


Image Input Pipelines

How vision LLMs tokenize images

Modern vision LLMs (GPT-4o, Claude 3.5, Gemini 1.5) encode images by dividing them into tiles and running a vision encoder (typically a ViT variant) on each tile.

GPT-4o tile pricing:

  • Base cost: 85 tokens per image (minimum)
  • Each 512×512 tile: +170 tokens
  • A 1024×1024 image = 85 + 4×170 = 765 tokens ≈ $0.0023 at current prices

Key insight: Resolution directly drives cost. Downsizing images before sending them is the single highest-leverage cost optimization.

Image preprocessing service

from PIL import Image
import io
import base64
from typing import Literal

def prepare_image_for_llm(
    image_path: str,
    target_detail: Literal["low", "high", "auto"] = "auto",
    max_dimension: int = 2048,
) -> tuple[str, dict]:
    """
    Resize and encode an image for LLM vision APIs.
    Returns (base64_string, cost_estimate).
    """
    img = Image.open(image_path).convert("RGB")
    orig_w, orig_h = img.size

    # Determine if we need to resize
    if max(orig_w, orig_h) > max_dimension:
        ratio = max_dimension / max(orig_w, orig_h)
        new_size = (int(orig_w * ratio), int(orig_h * ratio))
        img = img.resize(new_size, Image.LANCZOS)

    # Estimate token cost (OpenAI formula)
    w, h = img.size
    tiles = (w // 512 + (1 if w % 512 else 0)) * (h // 512 + (1 if h % 512 else 0))
    estimated_tokens = 85 + tiles * 170

    # Encode
    buffer = io.BytesIO()
    img.save(buffer, format="JPEG", quality=85)
    encoded = base64.standard_b64encode(buffer.getvalue()).decode()

    return encoded, {"tokens": estimated_tokens, "size_kb": len(buffer.getvalue()) // 1024}

# Usage
b64, cost_info = prepare_image_for_llm("invoice.jpg", max_dimension=1024)
print(f"Estimated tokens: {cost_info['tokens']}, size: {cost_info['size_kb']}KB")

Calling vision APIs

from anthropic import Anthropic

client = Anthropic()

def analyze_image(image_path: str, prompt: str) -> str:
    b64_image, _ = prepare_image_for_llm(image_path)

    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": b64_image,
                    },
                },
                {"type": "text", "text": prompt},
            ],
        }],
    )
    return message.content[0].text

Caching image analysis results

Image analysis is expensive and often idempotent (same image + same prompt = same result). Always cache:

import hashlib
import redis

cache = redis.Redis(host="localhost", decode_responses=True)

def cached_analyze_image(image_path: str, prompt: str, ttl: int = 86400) -> str:
    # Hash by file content + prompt (not path, which can change)
    with open(image_path, "rb") as f:
        content_hash = hashlib.sha256(f.read() + prompt.encode()).hexdigest()[:16]

    cache_key = f"vision:{content_hash}"
    cached = cache.get(cache_key)
    if cached:
        return cached

    result = analyze_image(image_path, prompt)
    cache.setex(cache_key, ttl, result)
    return result

Audio Input Pipelines

Transcription-first vs. native audio

Two approaches:

  1. Transcription-first: Audio → Whisper/Deepgram → text → LLM
  2. Native audio: Audio → multimodal LLM directly (GPT-4o audio, Gemini)

When to use transcription-first:

  • You need word-level timestamps
  • Audio is > 5 minutes (native audio has context window limits)
  • Cost is a priority (Whisper at $0.006/minute vs. GPT-4o audio at ~$0.10/minute)
  • You need to store/index the transcript

When to use native audio:

  • Tone, emotion, or speaker identity matters
  • Conversational AI where prosody changes meaning
  • Short clips (< 2 minutes)

Chunked transcription pipeline

from openai import OpenAI
from pydub import AudioSegment
import io

client = OpenAI()

def transcribe_long_audio(
    audio_path: str,
    chunk_minutes: int = 10,
    language: str = "en",
) -> str:
    audio = AudioSegment.from_file(audio_path)
    chunk_ms = chunk_minutes * 60 * 1000

    transcripts = []
    for i, start in enumerate(range(0, len(audio), chunk_ms)):
        chunk = audio[start : start + chunk_ms]

        buffer = io.BytesIO()
        chunk.export(buffer, format="mp3", bitrate="64k")  # downsample for cost
        buffer.seek(0)
        buffer.name = f"chunk_{i}.mp3"

        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=buffer,
            language=language,
            response_format="verbose_json",  # includes timestamps
        )
        transcripts.append(transcript.text)

    return " ".join(transcripts)

Document Understanding (PDF + Images)

PDFs are the most common multimodal input in enterprise applications. Two approaches:

Approach A: PDF → Images → Vision LLM

Best for PDFs with complex layouts, charts, or non-standard formatting.

from pdf2image import convert_from_path

def pdf_to_analyzed_text(pdf_path: str, dpi: int = 150) -> str:
    # Convert each page to image at 150 DPI (good balance of quality/cost)
    pages = convert_from_path(pdf_path, dpi=dpi)

    results = []
    for i, page_img in enumerate(pages):
        # Save to temp buffer
        buffer = io.BytesIO()
        page_img.save(buffer, format="JPEG", quality=85)
        b64 = base64.standard_b64encode(buffer.getvalue()).decode()

        result = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": b64}},
                    {"type": "text", "text": "Extract all text content from this document page. Preserve structure including headers, tables, and lists. Output in Markdown."},
                ],
            }],
        )
        results.append(f"## Page {i+1}\n\n{result.content[0].text}")

    return "\n\n---\n\n".join(results)

Approach B: PDF text extraction → LLM

Best when PDFs are digitally created (not scanned), cost matters, and layout is simple.

import pymupdf  # formerly fitz

def extract_pdf_text(pdf_path: str) -> str:
    doc = pymupdf.open(pdf_path)
    pages = []
    for page in doc:
        pages.append(page.get_text("markdown"))  # preserves basic formatting
    return "\n\n---\n\n".join(pages)

Cost comparison for a 10-page PDF:

Method Tokens Cost (Claude claude-opus-4-6)
Vision per page (150 DPI) ~8,500 ~$0.085
Text extraction ~3,000 ~$0.030
Vision at 72 DPI ~4,000 ~$0.040

Cost Optimization Strategies

1. Route by input type and complexity

Not every image needs a frontier model:

def route_vision_request(image_path: str, task: str) -> str:
    img = Image.open(image_path)
    w, h = img.size

    # Simple tasks (OCR, classification) → small fast model
    simple_tasks = ["extract_text", "classify_document_type", "detect_language"]
    if task in simple_tasks and max(w, h) < 2000:
        return call_vision_model("claude-haiku-4-5-20251001", image_path, PROMPTS[task])

    # Complex tasks → frontier model
    return call_vision_model("claude-opus-4-6", image_path, PROMPTS[task])

2. Thumbnail for classification, full resolution for extraction

def two_stage_document_processing(image_path: str) -> dict:
    # Stage 1: classify with thumbnail (cheap)
    thumbnail = prepare_image_for_llm(image_path, max_dimension=512)
    doc_type = classify_document_type(thumbnail)

    # Stage 2: extract only if worth it
    if doc_type in VALUABLE_DOC_TYPES:
        full_res = prepare_image_for_llm(image_path, max_dimension=2048)
        return extract_structured_data(full_res, doc_type)

    return {"doc_type": doc_type, "extracted": None}

3. Async batch processing

Vision tasks are often not latency-critical. Batch them:

import asyncio
from anthropic import AsyncAnthropic

async_client = AsyncAnthropic()

async def process_images_batch(image_paths: list[str], concurrency: int = 5) -> list[str]:
    semaphore = asyncio.Semaphore(concurrency)

    async def process_one(path):
        async with semaphore:
            b64, _ = prepare_image_for_llm(path)
            response = await async_client.messages.create(
                model="claude-haiku-4-5-20251001",
                max_tokens=512,
                messages=[{"role": "user", "content": [
                    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": b64}},
                    {"type": "text", "text": "Describe this image in one sentence."},
                ]}],
            )
            return response.content[0].text

    return await asyncio.gather(*[process_one(p) for p in image_paths])

Integrating multimodal capabilities into a RAG system? See our guide on RAG Systems at Scale.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.