Open-Source LLM Fine-Tuning and Serving Infrastructure for AI Agent Platforms

Executive Summary

As of March 2026, the open-source LLM ecosystem has reached a level of maturity that makes self-hosted deployment genuinely viable for production AI agent platforms. The gap between open-source and frontier commercial models has narrowed dramatically: Llama 4 Maverick (17B active parameters, MoE) benchmarks ahead of GPT-4o on several tasks, DeepSeek R1 matches OpenAI o1 on math and reasoning, and Qwen 2.5-72B competes with GPT-4 Turbo on coding and instruction following.

The strategic case for self-hosted LLMs in an agent platform is compelling: cost reduction of 5-20x at scale, full data sovereignty, ability to fine-tune for domain-specific behavior, and elimination of per-token API cost growth as usage scales. The practical case requires honest acknowledgment of operational complexity — GPU procurement, serving infrastructure management, fine-tuning pipelines, and ongoing maintenance.

This article covers the full stack: current open-source model landscape, fine-tuning approaches (LoRA/QLoRA, DPO, synthetic data), serving infrastructure (vLLM, SGLang, TGI), tiered routing architecture, GPU hardware decisions, and realistic cost analysis. The intended reader is an engineering team deciding how to architect a hybrid system using both self-hosted open-source models and commercial APIs.

Key recommendations:

Start with Llama 4 Scout (10M context, single-server INT4) or Qwen 2.5-72B as the primary self-hosted model
Use vLLM or SGLang for serving — both are production-ready with OpenAI-compatible APIs
Implement QLoRA for fine-tuning on task-specific data; full fine-tuning only when you have 100k+ high-quality examples
Route via complexity classification: fast/cheap self-hosted for routine agent tasks, commercial API for complex reasoning
Plan for L40S (48GB, ~$1.5-2/hr on spot) as the sweet spot for 70B models in Southeast Asia cloud regions

The 2026 Open-Source Model Landscape

Llama 4: Meta's MoE Breakthrough

Released in early 2026, Llama 4 represents Meta's most significant architectural shift — moving from dense transformers to Mixture of Experts (MoE). Two variants are relevant for production:

Llama 4 Scout (~109B total, 17B active, 16 experts):

Context window: 10M tokens (instruction-tuned) — the largest of any open model
Deployable on a single server with INT4 quantization
Benchmarks: MMLU Pro 74.3%, GPQA Diamond 57.2%
Native multimodal (text + images) via early fusion architecture
Multilingual: trained on 200 languages, fine-tuned for 12

Llama 4 Maverick (~400B total, 17B active, 128 experts):

Context: 1M tokens
Benchmarks: MMLU Pro 80.5%, GPQA Diamond 69.8%, MATH 61.2% — outperforms Llama 3.1 405B
Requires multiple GPUs; FP8 or BF16 deployment
Co-distilled from the larger (unreleased) Llama Behemoth

The iRoPE architecture in Llama 4 uses alternating NoPE layers (no positional encoding, full causal attention) and RoPE layers (chunked 8K attention) to achieve extreme context lengths without the quadratic attention cost. This is directly relevant for agent platforms where long conversation histories and large context windows matter.

Licensing: Custom Llama 4 Community License (must be accepted per model card). Commercial use permitted with attribution requirements — derivatives must be named "Llama 4 [Your Name]".

# Llama 4 Scout with transformers v4.51.0+
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Qwen 2.5: Alibaba's Production Workhorse

Qwen 2.5 offers the most complete size range in the open-source ecosystem: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. All variants share a strong multilingual foundation with particular strength in Chinese and Southeast Asian languages — directly relevant for a team operating in this region.

Qwen 2.5-72B-Instruct:

Architecture: 80 layers, GQA (64 Q heads, 8 KV heads), RoPE + SwiGLU + RMSNorm
Context: 131K tokens (YaRN for extended context)
Capabilities: Strong coding, mathematics, structured output (JSON), multilingual (29+ languages including Vietnamese, Thai, Arabic)
License: Qwen License (commercial use permitted for most use cases)
767K+ monthly downloads on Hugging Face as of early 2026

Qwen 2.5-Coder and Qwen 2.5-Math specialized variants provide task-specific excellence that surpasses the general-purpose 72B model on their respective domains.

The Qwen family is particularly well-suited for Southeast Asian deployments given Alibaba's optimization for regional language coverage and the fact that distilled versions of DeepSeek R1 run on Qwen base models (the R1-Distill-Qwen-32B outperforms OpenAI o1-mini on several benchmarks).

DeepSeek V3 and R1: Efficiency-First Architecture

DeepSeek has established itself as the efficiency leader — delivering frontier performance at a fraction of the training and serving cost.

DeepSeek V3 (671B total parameters, 37B activated, MoE):

Trained on 14.8T tokens
API pricing: $0.27/M input tokens (cache miss), $0.07/M (cache hit), $1.10/M output — the most competitive commercial pricing in the market
Inference speed: ~60 tokens/second (3x faster than V2)
Open weights available; self-hosting requires significant multi-GPU infrastructure

DeepSeek R1 (671B total, 37B activated):

MIT license — the most permissive of any frontier-class open model
Performance: AIME 2024 79.8% (vs OpenAI o1's 79.2%), MATH-500 97.3%, ArenaHard 92.3%
Trained entirely via reinforcement learning, demonstrating that reasoning capability emerges from RL without supervised fine-tuning
Distilled variants are the practical play for most teams:
- R1-Distill-Qwen-32B: outperforms o1-mini; runnable on 2x A100s or 1x H100
- R1-Distill-Qwen-14B: 1x A100 (80GB) or 2x L40S
- R1-Distill-Llama-70B: requires 4x A100s, competitive with full R1 on many tasks

# Serve DeepSeek R1 distilled with vLLM
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --enforce-eager \
  --temperature 0.6  # avoid repetition in reasoning traces

Critical configuration for R1 models: Use temperature 0.5-0.7 (0.6 recommended); do not use system prompts (all instructions in user turn); for math problems, include put your final answer within \boxed{}.

Mistral Large 2 (123B)

Released mid-2024, Mistral Large 2 remains competitive with 128K context, 123B parameters, and an MT-Bench score competitive with GPT-3.5 Turbo (8.30 vs 8.32). Key strengths: parallel and sequential function calls, strong multilingual (80+ coding languages, French/German/Spanish/Italian/Portuguese/Arabic/Hindi/Russian/CJK), and MMLU accuracy of 84.0%.

Licensing constraint: Mistral Research License for non-commercial use; commercial deployment requires a Mistral Commercial License. This restricts its viability for production agent platforms without negotiating licensing terms.

Model Selection Matrix

Model	Active Params	Context	License	Best For
Llama 4 Scout	17B (MoE)	10M	Community	Long-context agent tasks, multimodal
Llama 4 Maverick	17B (MoE)	1M	Community	High-quality general reasoning
Qwen 2.5-72B	72B	131K	Qwen (commercial)	Multilingual, coding, structured output
Qwen 2.5-32B	32B	131K	Qwen (commercial)	Cost-efficient mid-tier
DeepSeek R1-Distill-32B	32B	128K	MIT	Reasoning, math, step-by-step tasks
DeepSeek R1-Distill-14B	14B	128K	MIT	Compact reasoning, edge-of-budget
DeepSeek V3	37B active	128K	MIT	General tasks via API

Fine-Tuning Approaches

When to Fine-Tune vs Prompt Engineer

Fine-tuning is not always the right answer. The decision tree:

Can careful prompting + RAG solve it? → Start there. Iterate on prompts first.
Is the task highly repetitive at scale? → Fine-tuning reduces per-call costs by moving instructions into weights.
Do you have 1,000+ high-quality examples? → LoRA/QLoRA viable. Fewer than 500 examples often hurts more than helps.
Do you need consistent output format/style? → Fine-tuning locks in behavior better than prompting.
Is data privacy a hard requirement? → Self-hosted fine-tuning keeps data off commercial APIs.

For agent platforms specifically, fine-tuning targets tend to be: tool use format compliance, domain-specific terminology, consistent JSON schema adherence, and agent persona/tone.

LoRA: The Practical Standard

LoRA (Low-Rank Adaptation) freezes base model weights and trains two small matrices (A ∈ ℝ^(d×r) and B ∈ ℝ^(r×k)) injected at each attention layer. The effective weight update ΔW = BA where rank r << d. This reduces trainable parameters to typically 1-5% of total model size.

Memory comparison (Llama 2-7B):

Full fine-tune: ~112GB VRAM (BF16 weights + optimizer states)
LoRA (r=16): ~21GB VRAM
QLoRA (4-bit + r=16): ~14GB VRAM

Practical numbers:

LoRA adapter file size: ~24MB vs ~7GB for a 7B base model (288x smaller)
Loading time: 1.44-3.46 seconds per adapter swap
Inference overhead: negligible when fused (30% speedup via model.fuse_lora())

Optimal hyperparameters (from empirical testing):

Rank (r): 16-256; higher r captures more task-specific behavior
Alpha: set to 2x rank (alpha=512 with r=256 for best results)
Target modules: q_proj, k_proj, v_proj, o_proj (attention layers)
For MoE models (Mixtral, Llama 4): target attention layers only, not MLP/expert layers

from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=128,
    lora_dropout=0.05,
    r=64,
    bias="none",
    target_modules="all-linear",  # or specify ["q_proj", "v_proj", "k_proj", "o_proj"]
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, peft_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 6,738,415,616 || trainable%: 0.6228

QLoRA: Fine-Tuning on Consumer Hardware

QLoRA combines 4-bit NormalFloat quantization (NF4) of the base model with LoRA training. The 4-bit weights are dequantized to BF16 during the forward pass, maintaining training quality while dramatically reducing memory:

QLoRA with TRL SFTTrainer (complete example for Llama-3-8B on a single 24GB GPU):

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # double quantization saves ~0.4 bits/param
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",  # 3x speedup on Ampere+
)

peft_config = LoraConfig(
    r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

training_args = SFTConfig(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    bf16=True,
    gradient_checkpointing=True,
    packing=True,           # pack short sequences for efficiency
    max_seq_length=4096,
    logging_steps=10,
    save_steps=100,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,
)
trainer.train()
trainer.save_model()

Trade-off: QLoRA reduces VRAM by ~6GB vs standard LoRA (21GB → 14GB for 7B models) at the cost of ~30% slower training. For most agent fine-tuning tasks, QLoRA is the correct default.

DPO: Alignment Without the Reinforcement Learning Complexity

Direct Preference Optimization eliminates the RLHF complexity (reward model training, PPO optimization) by directly training on preference pairs using a binary cross-entropy loss that analytically approximates the RL objective.

DPO vs RLHF comparison:

Aspect	RLHF/PPO	DPO
Steps	4 (SFT → reward model → RL → eval)	2 (SFT → DPO)
Stability	Prone to reward hacking, PPO instability	Stable supervised learning
Compute	Requires reward model inference during training	No separate reward model
Data format	Human rankings → reward model	(prompt, chosen, rejected) pairs
Quality	Marginally better on complex tasks	Equivalent or better for most use cases

For agent platforms, DPO is the right choice for: improving tool use quality, reducing refusals, aligning tone/style, and improving instruction following. RLHF/PPO is still preferred for: complex safety alignment, nuanced value alignment tasks, or when you have human labelers producing a continuous feedback stream.

from trl import DPOTrainer, DPOConfig

# Dataset format: {"prompt": "...", "chosen": "...", "rejected": "..."}
training_args = DPOConfig(
    beta=0.1,              # KL divergence penalty (0.1-0.5, lower = more divergence allowed)
    output_dir="./dpo_output",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-6,    # lower than SFT
    bf16=True,
    remove_unused_columns=False,
)

dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=sft_model_ref,  # frozen copy of the SFT model
    args=training_args,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

Synthetic Data: Scaling Training Data Without Human Labelers

For teams without large labeled datasets, synthetic data generation is the practical path. The key insight: open-source "teacher" models (Mixtral, Qwen 2.5-72B) can generate training data for "student" models without violating API terms of service (unlike using OpenAI API outputs).

Cost comparison for 1M labeled examples:

Custom fine-tuned model inference: ~$2.70
GPT-3.5 API annotation: ~$153
GPT-4 API annotation: ~$3,061

The workflow:

Define 50-200 seed examples representing desired behavior
Use a capable open-source model (Mixtral 8x7B or Qwen 2.5-72B) with Chain-of-Thought prompting to generate expanded examples
Apply Self-Consistency (generate 5x, majority vote) to improve label quality
Validate 5-10% of examples manually using Argilla or LabelStudio
Fine-tune the target (smaller) model on validated synthetic data

Chain-of-Thought annotation prompt pattern:

annotation_prompt = """
You are an expert annotator. First reason step by step about the correct response,
then provide your final answer.

ALWAYS respond in this JSON format:
{{"reasoning": "step-by-step analysis...", "response": "final answer", "label": "category"}}

Input: {user_input}

JSON response:"""

This approach consistently improves accuracy from ~91% to ~94% versus direct annotation without reasoning.

Serving Infrastructure

vLLM: The Production Standard

vLLM has emerged as the dominant open-source LLM serving framework. v0.6.0 delivered substantial performance improvements:

2.7x higher throughput on Llama 8B
5x faster TPOT (time per output token) on Llama 8B
1.8x higher throughput and 2x lower TPOT on Llama 70B

These gains came from four architectural improvements:

Separated API server and inference engine processes (eliminating Python GIL contention)
Multi-step scheduling (schedule once, execute multiple consecutive steps; -28% CPU overhead)
Asynchronous output processing (overlap with model execution; -8.7% TPOT)
Object caching, non-blocking data transfers (+24% throughput)

vLLM achieves highest throughput on H100 GPUs vs TensorRT-LLM, SGLang, and LMDeploy on ShareGPT and decode-heavy workloads.

Key features:

PagedAttention for efficient KV cache memory management
Continuous batching (23x throughput vs naive static batching)
OpenAI-compatible API (drop-in replacement)
Multi-GPU: tensor parallelism, pipeline parallelism
Quantization: AWQ, GPTQ, FP8, INT4
LoRA: dynamic adapter loading/unloading

# Production vLLM deployment
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 131072 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 256 \
  --api-key "your-api-key"

# With quantization for memory reduction
vllm serve Qwen/Qwen2.5-72B-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --quantization awq \
  --max-model-len 32768 \
  --enable-prefix-caching

Dynamic LoRA loading (serve multiple fine-tuned variants on one base model):

vllm serve meta-llama/Meta-Llama-3-8B \
  --enable-lora \
  --lora-modules tool-use-adapter=./adapters/tool-use \
                 customer-service=./adapters/customer-svc \
  --max-loras 4 \
  --max-lora-rank 64

SGLang: The Speed Challenger

SGLang is an increasingly capable alternative, particularly for:

Workloads with high prefix sharing (RAG, agent prompts with repeated system context)
DeepSeek models (7x faster MLA execution due to specialized kernels)
Structured output (3x faster JSON decoding via compressed finite state machines)

SGLang performance claims (as of early 2026):

25x inference improvement on NVIDIA GB300 NVL72
5x faster via RadixAttention
3.8x prefill and 4.8x decode throughput on GB200 NVL72
Powers 400,000+ GPUs at organizations including xAI, AMD, NVIDIA, LinkedIn

# SGLang serving
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --port 30000 \
  --tp 2 \
  --mem-fraction-static 0.85 \
  --enable-torch-compile  # optional: JIT compilation for additional speedup

When to choose SGLang over vLLM:

You're running DeepSeek models (SGLang has specialized MLA attention kernels)
Your workload has high KV cache reuse (agent loops with stable system prompts)
You need structured output (JSON, regex) at high throughput
You're on the latest NVIDIA hardware (B200/GB300 where SGLang's optimizations are most pronounced)

TGI (Text Generation Inference): Maintenance Mode

Hugging Face's TGI is now officially in maintenance mode. The team has moved contributions to vLLM and SGLang, which they recommend going forward. New deployments should not be built on TGI. Existing TGI deployments should plan migration to vLLM or SGLang.

Ollama: Local Development Only

Ollama is excellent for developer workstations and local testing — simple installation, automatic GGUF quantization, HTTP API. It is not designed for production multi-user serving with high throughput requirements. Use it for development, use vLLM/SGLang for production.

Serving Framework Decision Tree

Is this for local development/testing?
  → Yes: Ollama

Is this for production serving?
  → Are you running DeepSeek models?
      → Yes: Consider SGLang (better MLA kernels)
  → Is structured output (JSON) a major workload?
      → Yes: SGLang (3x faster)
  → Is maximum throughput on H100 the priority?
      → Yes: vLLM (highest H100 throughput benchmark)
  → Default/uncertain:
      → vLLM (most community support, best documentation)

Continuous Batching: Why It Matters

The core innovation that makes modern LLM serving tractable is continuous batching (iteration-level scheduling). Traditional static batching holds all requests in a batch until the longest one completes, wasting GPU capacity. Continuous batching inserts new requests as slots free up.

Throughput comparison:

Naive static batching: baseline
NVIDIA FasterTransformer (optimized static): 4x
Basic continuous batching (TGI, Ray Serve): 8x
vLLM with PagedAttention: 23x

For an agent platform processing many concurrent short-to-medium length tasks, continuous batching is essential — without it, a single long-context request starves all other requests.

Tiered Architecture and Model Routing

The Core Principle

Not all agent tasks require the same model capability. A well-designed routing layer should:

Route simple, high-frequency tasks to cheap self-hosted models
Escalate complex reasoning, ambiguous instructions, and high-stakes decisions to commercial APIs
Provide latency-based fallback when self-hosted capacity is saturated

Routing Architecture

┌─────────────────────────────────────────────────────────┐
│                    API Gateway / Router                   │
│                                                          │
│  ┌──────────────┐    ┌───────────────────────────────┐   │
│  │   Classifier  │───►│         Route Decision         │   │
│  │  (fast, 3B)  │    │                               │   │
│  └──────────────┘    │  Simple/Routine  → Self-hosted │   │
│                      │  Complex/Reason  → Commercial  │   │
│                      │  Long-context    → Scout/10M   │   │
│                      │  Multimodal      → Scout/Maverick│  │
│                      └───────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘
           │                              │
           ▼                              ▼
  ┌─────────────────┐          ┌──────────────────────┐
  │  Self-Hosted    │          │   Commercial API      │
  │  Tier           │          │   Tier                │
  │                 │          │                       │
  │  vLLM / SGLang  │          │  Claude 3.5 Sonnet    │
  │  Qwen2.5-72B    │          │  GPT-4o               │
  │  Llama4 Scout   │          │  DeepSeek V3 API      │
  │  R1-Distill-32B │          │                       │
  └─────────────────┘          └──────────────────────┘

Routing Strategy Implementation

LiteLLM is the practical choice for production routing — it provides a unified OpenAI-compatible interface over any mix of models and supports multiple routing strategies:

from litellm import Router

router = Router(
    model_list=[
        # Self-hosted tier
        {
            "model_name": "self-hosted-fast",
            "litellm_params": {
                "model": "openai/qwen-2.5-32b",
                "api_base": "http://localhost:8000/v1",
                "api_key": "local-key",
            },
            "model_info": {"id": "qwen32b-local", "cost": 0.0001},
        },
        {
            "model_name": "self-hosted-smart",
            "litellm_params": {
                "model": "openai/qwen-2.5-72b",
                "api_base": "http://localhost:8001/v1",
                "api_key": "local-key",
            },
            "model_info": {"id": "qwen72b-local", "cost": 0.0003},
        },
        # Commercial fallback
        {
            "model_name": "commercial-fallback",
            "litellm_params": {
                "model": "claude-sonnet-4-6",
                "api_key": "sk-ant-...",
            },
        },
    ],
    routing_strategy="cost-based-routing",
    fallbacks=[{"self-hosted-fast": ["self-hosted-smart", "commercial-fallback"]}],
    context_window_fallbacks=[{"self-hosted-smart": ["commercial-fallback"]}],
    timeout=30,
    num_retries=2,
)

Complexity Classification

The routing decision requires a fast classifier. Options from simplest to most sophisticated:

Rule-based (zero latency):

def classify_complexity(request: str, max_tokens: int, tools: list) -> str:
    if len(request) < 200 and not tools and max_tokens < 500:
        return "fast"
    if len(tools) > 3 or "reason step by step" in request.lower():
        return "smart"
    if len(request) > 8000:
        return "long-context"
    return "standard"

Embedding-based classifier (5-10ms latency, best accuracy):

Train a small classifier on labeled examples of "easy" vs "hard" agent tasks
Embed the request using a 384-dim model (e.g., all-MiniLM-L6-v2)
Linear classifier adds negligible overhead

LLM-based meta-routing (50-100ms, highest quality):

Use a fast 3B model (Qwen 2.5-3B) as the router
Route its output to the appropriate tier
Cost: ~$0.00001 per routing decision

Fallback and Circuit Breaker Patterns

Production routing must handle self-hosted capacity exhaustion gracefully:

# LiteLLM configuration for automatic fallback
router_config = {
    "fallbacks": [
        {"qwen-72b-local": ["claude-sonnet-4-6"]},  # capacity fallback
        {"claude-sonnet-4-6": ["gpt-4o"]},           # provider fallback
    ],
    "context_window_fallbacks": [
        {"qwen-72b-local": ["llama4-scout-local"]},  # context length upgrade
    ],
    "cooldown_time": 60,      # seconds before retrying failed deployment
    "num_retries": 3,
    "retry_after": 0.5,       # exponential backoff
    "allowed_fails": 3,       # failures before cooldown
}

Cost Analysis

Self-Hosted vs API: The Break-Even Calculation

The economic case for self-hosting depends on three variables: token volume, GPU utilization, and GPU rental/ownership cost.

Reference GPU costs (cloud spot/on-demand, approximate 2026 rates):

GPU	VRAM	Models	Approx $/hr (spot)	Approx $/hr (on-demand)
RTX 4090	24GB	7-13B quantized	$0.40-0.80	$0.80-1.20
L40S	48GB	Up to 70B quantized	$1.20-2.00	$2.50-3.50
A100 SXM (80GB)	80GB	Up to 70B full	$2.00-3.00	$3.50-5.00
H100 SXM (80GB)	80GB	Up to 70B full, fastest	$3.00-5.00	$6.00-10.00

Token throughput estimates (rough guidance, varies by model/batch size):

GPU	Model	Estimated tokens/sec (output)
1x L40S	Qwen 2.5-32B (AWQ)	60-90 t/s
2x L40S	Qwen 2.5-72B (AWQ)	40-60 t/s
4x A100	Llama 4 Scout (INT4)	80-120 t/s
2x H100	Qwen 2.5-72B (BF16)	120-180 t/s

Break-even analysis (Qwen 2.5-72B vs Claude Sonnet API at $3/M output tokens):

At 2x L40S at $3.60/hr combined and 50 t/s average throughput:

Self-hosted cost per million output tokens: $3.60/hr ÷ (50 t/s × 3600 s/hr) = $0.020/M tokens
Claude Sonnet API: ~$3.00/M output tokens
Break-even: even at 20% GPU utilization, self-hosting is 10-20x cheaper per token

The caveat: utilization. If the GPU runs at 10% utilization (typical for bursty agent workloads with no queuing), the effective cost per token multiplies by 10, eroding the advantage. Solutions:

Use spot instances and scale to zero when idle
Queue requests to maintain high utilization during serving windows
Share GPU resources across multiple models using vLLM's multi-model serving

When API wins:

< ~10M tokens/month output: API operational simplicity outweighs cost
Unpredictable bursty traffic with no queue buffering
Tasks requiring absolute frontier model capability (complex coding, nuanced reasoning)

When self-hosting wins:

50M tokens/month output at any price sensitivity
Data sovereignty requirements
Need for custom fine-tuned behavior
Stable, predictable traffic patterns

Mixture of Experts Economics

MoE models like Llama 4 Scout/Maverick and DeepSeek V3 offer a fundamentally different cost profile: large total parameters (hence large knowledge) with small activated parameters (hence fast, cheap inference).

Llama 4 Maverick: 400B total parameters, 17B activated. This means the inference compute cost is similar to a 17B dense model while having knowledge capacity closer to a 400B model. For throughput-sensitive applications, this is the right architectural choice.

Hardware Decisions and Southeast Asia Considerations

GPU Selection for AI Agent Platforms

A100 (80GB SXM): The established workhorse. Well-supported by all frameworks, good performance, widely available. Preferred for training workloads (higher memory bandwidth for gradient accumulation). NVLink for multi-GPU tensor parallelism.

H100 (80GB SXM): 2-3x faster than A100 for inference due to FP8 support and faster memory bandwidth (3.35 TB/s vs 2.0 TB/s). Higher cost, but the throughput-per-dollar is superior for continuous serving at scale. FP8 quantization on H100 achieves near-BF16 quality with 2x the throughput.

L40S (48GB): The practical sweet spot for most teams. Lower cost than A100/H100, 48GB VRAM handles 70B models quantized to 4-bit or 32B models in full precision. Ada Lovelace architecture, GDDR6X (not HBM — lower bandwidth than A100, but sufficient for inference with good batching).

RTX 4090 (24GB): Cost-effective for 7-13B models. Suitable for development, cost-sensitive edge cases, or high-density distributed inference. Consumer PCIe card — thermal and power management require attention in datacenter environments.

Southeast Asia Cloud Options

For teams based in Southeast Asia, latency to Singapore or Jakarta matters. Options:

AWS Singapore (ap-southeast-1): p4d instances (8x A100 40GB, ~$32/hr on-demand), g5 instances (A10G 24GB, ~$1-4/hr). Spot availability is variable. Best for workloads that need AWS ecosystem integration.

GCP asia-southeast1 (Singapore): A100 via cloud TPU + GPU VMs. T4 instances are cheap for inference at scale. Good spot instance availability.

Vast.ai / RunPod: GPU spot markets that aggregate consumer and datacenter GPU capacity globally. Much cheaper than major clouds for GPU-heavy workloads (L40S at $1.50-2.00/hr vs $3.50+ on AWS). Lower reliability — use for batch jobs or with automatic fallback to on-demand.

On-premises in Singapore/Bangkok/KL: Capital-intensive but lowest ongoing cost at sustained utilization. 4x A100s at ~$50K-80K capital expense, amortized over 3 years at 70% utilization, costs ~$0.80-1.50/hr equivalent — cheaper than cloud at sustained load. Worth considering once token volume exceeds 500M/month.

Recommended Starting Configuration

For a team starting self-hosted LLM serving in Southeast Asia:

Phase 1 (MVP, $1,500-3,000/month):

2x L40S (48GB) via RunPod spot or Lambda Labs
vLLM serving Qwen 2.5-32B (fits in 2x L40S with AWQ, ~60-80 t/s)
Commercial API fallback (Claude/GPT-4o) for complex tasks and overflow
Expected coverage: 70-80% of agent requests on self-hosted

Phase 2 (Scale, $5,000-10,000/month):

4x A100 (80GB) or 2x H100
Llama 4 Scout (single-server INT4) + Qwen 2.5-72B
Dedicated fine-tuning pipeline (separate from serving cluster)
Redis-backed LiteLLM router with usage tracking

Phase 3 (Production, $15,000+/month):

H100 cluster (4-8 nodes) or reserved capacity
Multiple serving endpoints per model tier
On-premises consideration if in Singapore/Malaysia region

Mixture of Experts in Practice

Understanding MoE for Agent Platforms

Mixture of Experts is now the dominant architecture for frontier open-source models (Llama 4 Scout/Maverick, DeepSeek V3/R1, Qwen 3 MoE variants in development). Understanding the practical implications:

How it works: Each MoE layer contains N expert networks (FFN blocks). A learned router assigns each token to the top-K experts (typically K=2 or K=4). Non-selected experts perform no computation, giving the "37B active out of 671B total" numbers in DeepSeek V3.

Memory vs compute trade-off: All expert weights must fit in VRAM (or suffer expensive disk swap), but only K/N experts run per token. Practical implication: you need more VRAM than a dense model of the same quality, but get faster inference.

Llama 4 specific behavior:

Scout: every layer is MoE (16 experts, top-1 routing)
Maverick: alternating MoE and dense layers (128 experts in MoE layers, top-1 routing)
iRoPE architecture: NoPE layers (no positional encoding) enable extreme context lengths

Fine-tuning MoE models: Target attention layers only, not the expert (MLP) layers. Expert layers in MoE models don't benefit from LoRA in the same way as attention — the routing mechanism makes gradient flow through sparse paths less stable. In practice:

# For Llama 4 / Mixtral MoE fine-tuning
peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    # Explicitly target attention only, not gate/expert MLPs
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

Serving MoE models: Expert parallelism (distributing different experts across GPUs) is an option in both vLLM and SGLang. For DeepSeek R1 on 8 GPUs, expert parallelism + tensor parallelism can be combined for optimal throughput.

Production Deployment Patterns

Multi-Tenant LoRA Serving

One powerful pattern for agent platforms: serve a single base model with multiple LoRA adapters, each adapted for a different use case or customer:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --enable-lora \
  --lora-modules \
    agent-tools=./adapters/tool-use \
    customer-support=./adapters/support \
    code-review=./adapters/code \
    data-analyst=./adapters/analysis \
  --max-loras 8 \
  --max-lora-rank 64 \
  --max-cpu-loras 16  # cache adapters in CPU memory for fast swapping

Adapter loading takes 1.44-3.46 seconds; fused inference is 30% faster than unfused. For latency-sensitive serving, pre-warm the base model and load adapters at startup.

Prefix Caching for Agent Workloads

Agent systems often share large system prompts across requests (tool definitions, persona instructions, context). Prefix caching reuses the KV cache for identical prefixes:

vllm serve Qwen/Qwen2.5-72B-Instruct \
  --enable-prefix-caching \
  --prefix-caching-block-size 16

For agent platforms where system prompts can be 2,000-10,000 tokens, prefix caching can reduce TTFT (time to first token) by 40-70% and reduce effective compute by 20-30%.

Speculative Decoding

Speculative decoding uses a small "draft" model to predict K tokens ahead, then verifies with the main model in a single forward pass. When the draft model is accurate, this gives near-linear speedup at no quality cost:

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B \
  --num-speculative-tokens 5 \
  --speculative-draft-tensor-parallel-size 1

Typical gains: 1.5-2.5x latency reduction on typical agent task distributions. Most effective when output tokens are predictable (e.g., structured JSON, code generation).

Monitoring and Observability

Production LLM serving requires observability at the model level, not just infrastructure:

# LiteLLM with callback-based logging
import litellm

litellm.success_callback = ["langfuse"]  # or "prometheus", "datadog"
litellm.failure_callback = ["langfuse", "slack"]

# Track per-model costs and latency
response = await litellm.acompletion(
    model="qwen-72b-local",
    messages=messages,
    metadata={
        "user_id": user_id,
        "task_type": task_type,
        "session_id": session_id,
    }
)

Key metrics to track: tokens per second (output), time to first token (TTFT), request queue depth, GPU memory utilization, cache hit rate (for prefix caching), cost per request by model tier, and fallback rate (self-hosted → commercial API).

Getting Started: Recommended Sequence

For a team starting from zero today:

Week 1: Infrastructure setup

Deploy vLLM with Qwen 2.5-32B on 2x L40S (RunPod or Lambda Labs)
Wrap with LiteLLM router; configure Claude Sonnet as fallback
Validate OpenAI-compatible API works with existing agent code

Week 2: Routing

Instrument requests to collect complexity signals (length, tool count, reasoning steps)
Implement rule-based routing (covers 70% of cases with zero latency)
A/B test: route 10% to commercial API, verify quality parity for simple tasks

Week 3-4: Fine-tuning pipeline

Collect 500-2,000 examples of agent tasks with desired behavior
Run QLoRA fine-tuning on Qwen 2.5-7B (fits on a single L40S, 2-4 hour run)
Evaluate fine-tuned model vs base model on held-out task distribution
Deploy via vLLM LoRA serving if quality improvement is validated

Month 2: Scale

Expand to Qwen 2.5-72B for higher-quality self-hosted tier
Build synthetic data pipeline for ongoing fine-tuning
Implement DPO using collected preference data from production (chosen: accepted responses, rejected: flagged responses)

Sources: Meta AI Llama 4 release (March 2026), Qwen 2.5 HuggingFace model card, DeepSeek R1 technical report (arXiv 2501.12948), vLLM v0.6.0 performance update, SGLang GitHub documentation, HuggingFace quantization overview, TRL DPO trainer documentation, LiteLLM routing documentation, Mixtral 8x7B release post, HuggingFace synthetic data blog, HuggingFace TGI repository (maintenance mode notice)