2026-01-10

Edge AI & On-Device LLMs 2026

research

Research Date: 2026-01-10

Executive Summary

42%+ developers now run LLMs locally for privacy, cost, and latency benefits. NPUs hitting 80+ TOPS standard in flagships. Industry converging on hybrid local+cloud architecture.

Key Frameworks

FrameworkBest ForStrength
llama.cppMax performance65K stars, NVIDIA 35% speed boost
MLXApple SiliconUnified memory, 12x vs CPU
MLC LLMCross-platformOpenAI-compatible API
TensorRT Edge-LLMAutomotive/RoboticsReal-time on Jetson
ExecuTorchMobilePowers Meta apps

Popular On-Device Models

ModelSizeBest Use
Qwen3-0.6B0.6BMobile, 40 tok/s
Llama 3.21B/3BiOS/Android
Phi-414BReasoning
Gemma 3n-Everyday devices
Ministral-33.4BEdge, ~8GB VRAM

Hardware Requirements

Model SizeMin VRAMQuantization
7B8GB4-bit
13B16GB4-8 bit
70B40GB+4-bit

New 2026: NVIDIA NVFP4 (60% memory save, 3x speed), NVFP8 (40% save, 2x speed)

Performance Benchmarks

PlatformModelSpeed
iPhone 15 ProQwen3-0.6B~40 tok/s
RTX 4060 Ti7B 4-bit20-40 tok/s
Apple M-seriesVarious40-80 tok/s
Raspberry Pi 51B5-15 tok/s

TTFT: Cactus achieves sub-50ms

Mobile Chips 2026

ChipNPUImprovement
Snapdragon 8 Gen 580 TOPS+46% AI
Dimensity 9500NPU 9902x perf, -56% power
Snapdragon X2 Plus80 TOPSPC/laptop

Key Players

Apple Intelligence

  • Siri 3.0 delayed to Fall 2026
  • $1B/year Google Gemini deal
  • Architecture: On-device → Private Cloud → Gemini

Google

  • Gemma 3n for everyday devices
  • Coral NPU for wearables
  • LiteRT with MediaTek: 12x vs CPU

NVIDIA

  • TensorRT Edge-LLM for automotive
  • JetPack 7.1 for Jetson T4000
  • 35% faster llama.cpp, 3x ComfyUI

Use Cases

  1. Privacy: Healthcare, finance, biometrics
  2. Offline: Remote monitoring, translation
  3. Automotive: In-car AI (Bosch, ThunderSoft)
  4. Coding: Local copilots (Pieces, Private LLM)
  5. Cost: 4.3x cheaper than cloud at scale

Recommendations

Framework Choice:

  • Apple → MLX
  • Cross-platform → MLC LLM
  • NVIDIA → llama.cpp
  • Mobile → ExecuTorch or Cactus

Model Choice:

  • Mobile: 1-3B (Qwen3-0.6B, Llama 3.2 1B)
  • Desktop 8GB: 7B 4-bit
  • Desktop 16GB+: 13B or 70B quantized

Key Insight

NPU ubiquity (80+ TOPS) + automated quantization tools (HAQA) = on-device AI becoming default for privacy-sensitive applications. Industry converging on Apple's hybrid model: local for common queries, cloud for complex reasoning.


New topic - not previously covered in KB