Cross-Instance State Aggregation for Autonomous Agent Fleets: Data-Plane Patterns for Real-Time Multi-Agent Dashboards

Executive Summary

When you operate a fleet of autonomous AI agents — each running its own local state, dashboard, and read API — you need an aggregation layer that can pull together live status from N agents and present it as a coherent view. This document is about the data plane for that aggregation: how state flows from agents to the hub, not which metrics to display.

The core design decisions are:

Transport model — periodic pull vs. streaming push, and the latency/connection-count trade-offs between them
Uniform endpoint model — treating every agent as a network endpoint rather than special-casing co-located instances
Scoped authentication — short-lived read tokens exchanged from long-lived keys, so the observation plane carries least-privilege credentials
Staleness as first-class state — distinguishing "agent idle", "agent unreachable", and "aggregator partitioned" at the data model level
Topology choice — embedded hub vs. standalone aggregator service (a packaging question, not an architectural one)

Real systems — Prometheus federation, Kubernetes list-watch/informers, OpenTelemetry Collector fan-in, Consul/Serf gossip — have each solved pieces of this puzzle. The patterns are well-established; the novelty for AI-agent fleets is that the state schema differs (context-window %, token cost, active tool, runtime heterogeneity) and the fleet sizes are currently small enough that simpler designs dominate.

The Core Problem

A hub instance needs to present the live state of N agents. Each agent exposes a read API at some network address. The hub must:

Collect current state from all agents
Detect agents that are down, unreachable, or idle
Serve a dashboard that reflects reality as closely as tolerable staleness allows
Remain useful even when individual agents or the hub itself are temporarily unavailable

The naive solution — polling each agent on a fixed interval — works fine at small N. The interesting design decisions appear as N grows, as agents run across separate machines, or as you want sub-second freshness.

Pull vs Push: Transport Trade-offs

The Pull Model

The canonical pull model is Prometheus's scrape architecture. Each target exposes a /metrics endpoint; the Prometheus server scrapes each target at a configured interval (default 15s, commonly tuned to 5–30s). For fleet aggregation, a global Prometheus can federate from per-cluster leaf instances by scraping their /federate endpoints, selecting only the high-level time series it needs.

Pull's key properties:

Aggregator controls the schedule. If an agent is slow or overwhelmed, the aggregator can back off without the agent needing to implement backpressure.
No persistent connection. Each scrape is a short HTTP GET. The aggregator can scrape N agents with N short-lived connections, then idle. Memory and connection-table footprint on the aggregator is predictable.
Dead-simple agent side. Each agent only needs to serve one endpoint. No keep-alive socket management, no reconnect logic.
Latency floor = interval. You cannot know an agent's state more freshly than the last scrape interval. For a 15s interval, worst-case staleness is 15s.

Pull becomes strained when:

N is large and scrape latency varies (scrape timeouts accumulate)
You want sub-second freshness (very short intervals cause thundering-herd on small agents)
Agents run behind NAT or a firewall that blocks inbound connections to the agent

Prometheus's push gateway solves the NAT/firewall case: agents push to an intermediary that the Prometheus server can reach. Weaveworks and Zapier both maintain aggregating push gateway variants that merge pushes from multiple sources before Prometheus scrapes them.

The Push / Streaming Model

Push models — SSE (Server-Sent Events) or WebSocket — invert the flow: each agent maintains a persistent connection to the aggregator and streams state changes as they happen.

SSE fan-in is the simpler choice for unidirectional state streams (agent → hub). SSE is HTTP/1.1-compatible, passes through most CDN and reverse-proxy layers without special configuration (Fastly, Cloudflare, nginx all handle it), and provides automatic reconnect with Last-Event-ID semantics baked into the browser spec. Timeplus benchmarks show SSE and WebSocket have roughly equivalent throughput and latency (3ms difference) for typical dashboard payloads; the framing overhead is irrelevant compared to JSON serialization and network RTT.

WebSocket fan-in is appropriate when agents and the hub need bidirectional communication — but for a read-only observation plane this is unnecessary complexity. WebSocket also requires sticky-session affinity when load-balancing across multiple aggregator instances, whereas SSE reconnects can land on any instance if state is backed by a shared store.

Backpressure is the primary operational risk in push models. If the aggregator is slow to consume (CPU spike, GC pause, downstream write stall), a fast-pushing agent will exhaust its send buffer. Proper mitigation is:

Bounded per-agent write buffers on the aggregator side
A drop-newest or drop-oldest policy when the buffer fills
The agent detects a slow consumer via socket write errors and falls back to a reduced push rate

The FastAPI/asyncio WebSocket community has documented this clearly: "If you design for fan-out from day one — through sharding, hierarchy, backpressure, and broadcast control — you gain predictability."

Hybrid: Push with Poll Fallback

The most resilient design is push-primary with polling as degraded mode:

Each agent tries to establish a persistent SSE connection to the aggregator on startup
If the aggregator is unreachable, the agent buffers locally (bounded) and retries with exponential backoff
The aggregator also runs a background poller for agents that never established a push channel (e.g., older agent version, firewall issue)
The dashboard shows which transport each agent is using as a secondary status signal

This mirrors what the Kubernetes informer mechanism does: it starts with a list (full snapshot via poll) and then transitions to watch (persistent streaming). If the watch stream breaks, it falls back to re-list. The resourceVersion field prevents missing events during the gap. A fleet aggregator can implement the same pattern using a monotonic sequenceId per agent.

Pull vs Push Trade-off Table

Property	Pull (periodic poll)	Push (SSE/WS stream)
Freshness	Bounded by interval	Near real-time
Agent complexity	Minimal (serve one endpoint)	Must manage connection lifecycle
Aggregator connections	Short-lived, bursty	Persistent, N concurrent
Firewall/NAT	Aggregator must reach agent	Agent initiates outbound
Backpressure	Aggregator controls naturally	Explicit buffer management needed
Reconnect handling	Aggregator retries on failure	Agent reconnects with backoff
Best for	Small fleet, low-churn state	Large fleet, high-churn state

Uniform Endpoint Model

A critical simplification: treat every agent — including co-located ones on localhost — as a network endpoint with the same shape: {name, base_url, scoped_read_token}.

The temptation to special-case local agents (reading state from shared memory, a local file, or an in-process call) creates a two-code-path problem. Every local shortcut needs its own retry logic, error handling, and auth bypass. When co-located agents later move to separate machines (or when you add a remote agent), the aggregator needs surgery.

Uniformity gives you:

One code path for all agents, simplifying testing and debugging
Isolation: even a co-located agent that crashes cannot corrupt the aggregator's state because they communicate over a network boundary (even if it's loopback)
Portability: the same aggregator binary works for all-local, all-remote, or mixed deployments

The base URL for a local agent is simply http://127.0.0.1:<port>. The overhead of an HTTP call to loopback is negligible compared to the simplicity gain.

This is how the Kubernetes API server and kubelet relationship works: even in single-node (kind/minikube) clusters, kubelet communicates with the API server over the same HTTPS endpoint that remote kubelets use. No special-casing.

Scoped Data-Plane Auth

For an observation-only plane, you want the aggregator to hold credentials that can only read agent state, not issue commands or modify configuration. This is the read/admin separation that AWS IAM enforces structurally: read-only policies cannot be accidentally escalated to write.

Two-Layer Token Architecture

Borrowing from AWS STS's AssumeRole pattern:

Layer 1: Long-lived API key — stored only in the aggregator's config/secrets store. Never transmitted in individual scrape requests. Used once per session to exchange for a short-lived token.
Layer 2: Short-lived scoped session token — exchanged from the API key at session start (or on TTL expiry). Scoped to read operations only. Transmitted in every HTTP request header.

The dual-layer design matters specifically in plaintext or no-HTTPS environments (e.g., a local development cluster on a trusted LAN, or agents communicating over a VPN without per-service TLS). If the session token is sniffed in transit, the attacker gets a short-lived credential that can only read dashboard state — not reconfigure or control agents. The long-lived key that could issue new tokens never travels the wire after initial exchange.

Session token TTLs of 15–60 minutes match the Prometheus scrape and OTel collector refresh cycles. Token refresh should happen proactively (refresh at 80% of TTL), not reactively (waiting for a 401 to trigger refresh), to avoid a stall on the critical path.

Least-Privilege Scope

The read scope should be narrowly defined:

GET /state — current agent snapshot
GET /health — liveness and heartbeat
GET /metrics — if using Prometheus exposition format

Write endpoints (POST /command, PATCH /config) must require the admin key, which the aggregator never holds. This prevents a compromised aggregator from becoming a lateral-movement pivot into agent control planes.

Staleness and Liveness as First-Class State

The most common mistake in fleet dashboards is treating "no data" as the same as "agent is idle." These are distinct states that require different UI treatment and different operational responses.

Three Distinct Absence States

State	Meaning	Trigger
Idle	Agent is up, reachable, has no active work	Agent reports low activity; heartbeat current
Unreachable	Aggregator cannot connect; agent may be fine	Poll/push timeout, no heartbeat within TTL
Offline / Down	Agent process has exited or machine is down	No heartbeat for extended period (2–3× TTL)

The distinction between "unreachable" and "down" matters because network partitions are common and transient. Consul's Lifeguard extension to the SWIM gossip protocol explicitly addresses this: the original protocol incorrectly declared members as faulty when CPU exhaustion or network delay caused slow message processing. Lifeguard introduced "situational awareness" — a node suspects itself is faulty before suspecting others — reducing false positives dramatically.

For a fleet aggregator, the equivalent is: before marking an agent as down, check whether other recently-healthy agents are also failing (suggests aggregator-side connectivity issue, not agent failure).

Heartbeat TTL Design

                    ┌──────────────────────────────────────────┐
time →              │  heartbeat  │  heartbeat  │  heartbeat   │
                    └──────────────────────────────────────────┘
                    0            T            2T            3T

Agent state in aggregator:
  t < T+grace:     FRESH (last_seen within TTL)
  T+grace < t < 2T: STALE (show warning on card)
  t > 2T:           UNREACHABLE (show offline card)
  t > 3T:           PRESUMED DOWN (trigger alert)

Clock skew between machines can cause false positives. Mitigations:

Use the aggregator's receive timestamp, not the agent's send timestamp, for TTL calculation
Add a grace period equal to 2× expected clock drift (NTP-synchronized machines: 50ms; unsynchronized: up to a few seconds)
For the dashboard, display both the agent's reported timestamp and the aggregator's last-seen time so operators can diagnose skew

Cached State on Agent Unavailability

The aggregator should serve the last known state for an unreachable agent rather than an error. The UI overlays a "stale" badge with the age of the data. This is the same pattern Redis uses for read-from-replica with slave-serve-stale-data yes: return possibly-stale data rather than an error, but annotate it. Returning an error from the dashboard when one of N agents is down makes the dashboard unreliable precisely when operators need it most.

Service Registry and Discovery

For a fleet dashboard, discovery is how the aggregator learns which agents exist and where they are.

Static Registry

The simplest: a YAML/JSON config file listing {name, base_url, token_hint} for each agent. The aggregator loads it at startup (and watches for file changes).

Appropriate when: fleet size is small (2–20 agents), agents are long-running on known addresses, and operational simplicity is the priority. This covers most current AI-agent fleet deployments. Static registry has zero runtime dependencies: no Consul, no etcd, no DNS-SD.

Self-Registration

Each agent, on startup, registers itself with the aggregator by calling POST /registry/agents. On shutdown (or via a health-check timeout), it deregisters.

Trade-off: couples agent code to the aggregator's registration API. Also requires the aggregator to be up before agents start — or agents need retry logic. The Kubernetes service registry pattern avoids this by making registration a kubelet/controller responsibility, not the application's.

Appropriate when: agents are ephemeral (spin up/down frequently), addresses are dynamic (container IPs), or fleet size exceeds ~50 where manual config is error-prone.

External Discovery

DNS-SD, Consul catalog, or Kubernetes Service/Endpoint resources. The aggregator queries an external registry to enumerate agents; agents register with that registry, not directly with the aggregator.

Appropriate when: you already run Consul or Kubernetes for other purposes and want to avoid a parallel registry. Adds operational complexity for small fleets with no other orchestrator.

Recommendation by Fleet Size

Fleet Size	Recommended Approach
2–10 agents	Static config file
10–50 agents	Static config + file-watch for hot reload
50+ agents	Self-registration or external discovery

Topology: Embedded Hub vs. Standalone Aggregator

Two packaging options for the aggregation logic:

Embedded hub: the aggregation logic lives inside one existing agent's process. That agent serves both its own dashboard and a "fleet view" that aggregates peers. This is convenient for small fleets where one agent is naturally "primary."

Standalone aggregator service: a dedicated process whose sole job is aggregation. It has no agent logic of its own.

The key insight: the aggregation logic is identical in both cases. The same polling/streaming engine, the same state model, the same auth layer. The choice is purely packaging and deployment:

Embedded hub is simpler to deploy (one fewer process), but couples the fleet view to one agent's availability. If that agent restarts, the fleet view is temporarily unavailable.
Standalone aggregator decouples fleet visibility from individual agent health. It can be deployed on a management machine that has no other agent responsibilities.

Graceful degradation applies to both: if the aggregator (embedded or standalone) is unreachable, individual agents continue operating normally. The aggregation plane is observation-only; agents are not dependent on it for their own functioning. This is a hard requirement — the aggregator must never be on the critical path for agent operation.

If the aggregator goes down, individual agent dashboards remain accessible directly via their own local UI. The fleet view just goes blank until the aggregator recovers.

Consistency Model: Best-Effort Observation

This data plane is eventually consistent and best-effort. It is explicitly not a control plane, not a source of truth, and not a transactional system.

Implications:

No consensus required. The aggregator does not need quorum. It can serve its last known view from a single instance.
Partial views are acceptable. A fleet dashboard showing 7/8 agents is more useful than showing nothing while waiting for the 8th. The UI should clearly indicate which agent is missing and why.
Idempotent updates. State pushes from agents should be full snapshots, not deltas, so that a missed push doesn't leave the aggregator in a permanently inconsistent state. (Alternatively, deltas with sequence numbers and periodic snapshot resets, as Kubernetes does with list-watch.)
No transactions across agents. The aggregator never needs to present an "atomic" view of all agents at the same logical time. Each agent's state is independently timestamped.

This model aligns with how Datadog and SigNoz instrument LLM applications: they collect spans and metrics from each agent independently, aggregate asynchronously, and display dashboards that may lag by seconds. The display is informational, not authoritative.

AI-Agent Fleet State: What's Different

Traditional service metrics (CPU %, request rate, error rate) are well-understood. LLM-agent state introduces several novel dimensions:

Context Window Exhaustion

Context window fill percentage is a non-renewable resource per session. Unlike memory (which can be freed), once context is consumed, the only recovery is a new session. This means:

The aggregator should display context fill as a prominent, color-coded signal (green → yellow → red)
Trend matters: an agent at 60% context that is filling at 5%/minute needs attention sooner than one at 80% that has been stable for an hour
Context exhaustion is silent — the agent doesn't error, it just starts forgetting. The aggregator is often the only place this is visible across the fleet

Token Cost as a First-Class Signal

Unlike CPU (which scales with cloud provider pricing), token cost is a direct, real-time billing signal. The aggregator should accumulate:

Tokens in/out per agent per session
Estimated cost (input tokens × input price + output tokens × output price, per model)
Cumulative cost across the fleet for the current billing period

Platforms like Datadog LLM Observability, LangSmith, and Langfuse have converged on this as a required metric. A 2025 Mavvrik study found 50% of AI-core companies don't track LLM API costs at all — fleet aggregation is an opportunity to surface this by default.

Runtime Heterogeneity

A fleet may mix Claude (Anthropic API), Codex (OpenAI API), and local models (Ollama). Each has:

Different context window sizes (Claude 3.5 Sonnet: 200k tokens; GPT-4o: 128k; local Llama 3: 8k–128k depending on quantization)
Different pricing models
Different rate limits (TPM/RPM per API key)

The aggregator's state schema must be model-aware. The context window fill % must be calculated relative to the specific model's limit, not a global constant. Rate limit headroom is a per-API-key concern, not a per-agent one, if multiple agents share a key.

Active Tool as State

LLM agents typically spend significant time in tool calls (bash, file read, web search). The currently active tool is valuable fleet-level signal:

An agent that has been in bash for 10 minutes may be stuck in a long-running command or an infinite loop
An agent in WebSearch for >2 minutes may be hitting a rate limit
Aggregated tool distribution across the fleet reveals what the fleet is collectively doing

Traditional APM has nothing equivalent to this. It maps loosely to distributed tracing's active span concept (the OTel collector fan-in model can carry this as a trace attribute), but the semantics are specific to LLM agents.

Practical Recommendations

What to Build for an MVP

Static registry: a YAML file listing each agent's {name, url, token}. Aggregator reads it on startup.
Pull model with 5–10s interval: one goroutine/async task per agent, simple HTTP GET to /state. No persistent connections to manage.
Short-lived session tokens: each agent issues read tokens via POST /auth/session (input: long-lived key, output: 30-minute read token). Aggregator refreshes proactively.
Three-state liveness model: FRESH / STALE / UNREACHABLE based on last-seen TTL. Serve last-known-state for stale/unreachable agents with visual annotation.
Uniform endpoint model from day one: even for localhost agents. No special cases.

This is enough for a working fleet dashboard. The pull model with a handful of agents is operationally trivial and has zero persistent connection state to manage.

What to Defer

Push/streaming transport: only worth the complexity when you need sub-second freshness or the fleet exceeds ~30 agents where poll latency accumulates
Self-registration / dynamic discovery: unnecessary until agent addresses change frequently or fleet size exceeds ~20
Standalone aggregator service: consider this when the embedded hub's restarts cause noticeable fleet-view gaps, not before
Distributed aggregator (HA aggregator cluster): the aggregator is observation-only; a brief gap during a restart is acceptable. Full HA adds significant complexity for minimal operational gain at small scale
Full OTel pipeline integration: valuable for connecting fleet state to your existing observability stack, but adds an OTel Collector dependency that is overhead for a self-contained fleet

Implementation Checklist

All agents expose GET /state and GET /health on a configured port
Auth: agents issue short-lived read tokens; aggregator never stores or uses long-lived keys in per-request headers
State schema includes: {agent_id, timestamp, context_pct, tokens_in, tokens_out, cost_usd, model, runtime, active_tool, status}
Aggregator state model distinguishes FRESH / STALE / UNREACHABLE; UI shows last-known state with age badge
Aggregator is not on the critical path for any agent's operation
Dashboard shows per-agent last-seen time and transport type (for diagnosing connectivity issues)

Conclusion

Cross-instance state aggregation for agent fleets is a well-solved problem in distributed systems observability — the patterns from Prometheus federation, Kubernetes list-watch, OTel Collector, and Consul gossip each contribute a piece. The challenge for AI-agent fleets is applying the right pattern at the right scale: most current deployments are small enough that a pull model with static registry and three-state liveness is entirely sufficient, and the complexity of streaming fan-in or dynamic service discovery is premature.

The genuine novelty is in the state schema: context window exhaustion, real-time token cost, and active tool state are LLM-specific signals with no direct analogue in traditional service observability. Getting the data plane right enables these signals to be surfaced clearly — but the data plane itself is infrastructure, and it should be boring.

Sources:

Prometheus Federation — Official Prometheus docs on hierarchical federation
Prometheus Federation Architecture & Pitfalls — Groundcover
Kubernetes Informers Deep Dive — Arriqaaq
Kubernetes ListWatch Anatomy — mgasch.com
OpenTelemetry Collector Architecture — OpenTelemetry official docs
WebSockets vs SSE — Ably
WebSocket vs SSE Performance Comparison — Timeplus
Backpressure in WebSocket Streams — Skyline Codes
Consul Gossip Protocol — HashiCorp
Making Gossip More Robust with Lifeguard — HashiCorp
AWS STS Temporary Credentials — AWS IAM docs
Heartbeats in Distributed Systems — Arpit Bhayani
Service Discovery in Microservices — Baeldung
Datadog LLM Observability — Datadog
LLM Cost Tracking — Traceloop