Runtime-Source Drift Detection for Agent Component Systems
Executive Summary
Agent component systems have a drift problem that looks familiar but is not identical to Kubernetes or infrastructure drift. Source repositories may be merged, an installed component directory may still contain an older version, a runtime process may be running stale code, local hotfixes may exist outside Git, and automated upgrade tooling may leave smart-merge backups that quietly change the operational truth.
For a small agent platform, this can be dismissed as ordinary deployment hygiene. For a persistent AI agent system, it becomes a reliability boundary. Agents run scheduled jobs, answer users across channels, modify files, use credentials, and continue across days. If the operator cannot answer "what source is this running, what local changes exist, and what would change if I upgraded," then component updates become guesswork.
The infrastructure world has already developed useful patterns. Argo CD continuously compares desired Git state with live cluster state. Flux can detect Helm release drift. Terraform separates drift detection from mutation through refresh-only plans. Package managers bind installed artifacts to checksums and lockfiles. SLSA and Sigstore model provenance and verification as first-class supply-chain metadata.
The lesson for agent components is not "turn every agent install into Kubernetes." It is more specific: build a lightweight desired-versus-live control plane around component state.
The minimum useful design has five parts:
- a source manifest: repo, commit, version, lockfile, and expected file hashes
- an installed manifest: actual files, generated files, local edits, backup files, and config pointers
- a runtime manifest: PM2 process, command, cwd, environment source, uptime, and loaded version
- a drift plan: classify differences before changing anything
- a provenance record: who or what produced the installed artifact, and from which source revision
The most important product choice is to treat local mutation as operational state, not as dirt to erase. Agent platforms need detection and review before repair. Auto-healing should be optional and scoped, because local patches, config migrations, and emergency fixes may be intentional.
Why Agent Components Drift
Agent components occupy an awkward middle ground. They are not just libraries imported by an application. They are also not always immutable services deployed by a strict CI/CD pipeline. A component may include:
- executable scripts
- configuration templates
- PM2 service definitions
- local data directories
- OAuth credentials or token references
- scheduler tasks
- skills or instruction files
- generated artifacts
- user-modified runtime state
That mix creates multiple truths.
Git says what the maintainer merged. The registry says what version is available. The local install directory says what files are actually on disk. PM2 says what process is currently running. The component database says what the platform thinks is installed. Memory or documentation may say something else again.
Drift appears when these truths diverge.
Common examples:
- A PR is merged but the local component was never upgraded.
- The local component was upgraded but PM2 was not restarted.
- A smart merge produced backup files that still affect debugging and operator confidence.
- A local hotfix exists in the installed directory but not in Git.
- Generated instruction files were rebuilt from an older template.
- A package version was bumped but a runtime model list still differs across adapters.
- A config file points to the new service while the running process still uses the old cwd.
None of these are exotic. They are normal operations in a fast-moving agent environment. The failure is not that drift exists. The failure is when the system cannot see it.
The GitOps Lesson: Desired Versus Live State
GitOps systems make one idea operationally powerful: desired state and live state should be continuously comparable.
Argo CD calculates differences between desired state and live state to decide whether an application is out of sync, and exposes those differences in the UI. Its newer server-side diff mode uses Kubernetes server-side apply in dry-run mode to predict the live object before actually syncing. That matters because admission controllers and server defaults can affect what would really happen.
For agent components, the direct equivalent is not a Kubernetes manifest. It is a component manifest:
{
"component": "recruit",
"source": {
"repo": "zz-howard/zylos-recruit",
"commit": "3b1f59f",
"version": "0.2.9"
},
"install": {
"path": "~/zylos/.claude/skills/recruit",
"hashManifest": "sha256:..."
},
"runtime": {
"service": "zylos-recruit",
"manager": "pm2",
"expectedCwd": "~/zylos/.claude/skills/recruit"
}
}
A drift detector can then compare:
- desired source commit versus installed metadata
- expected file hashes versus actual files
- declared service cwd versus PM2 cwd
- declared start command versus PM2 command
- declared config pointers versus actual config files
- generated files versus their source templates
This is the smallest useful GitOps translation: not automated deployment, but inspectable desired-versus-live comparison.
The Terraform Lesson: Review Drift Before Repair
Terraform's drift workflow is valuable because it separates observation from mutation. Its refresh-only plan detects differences between state and real infrastructure without changing remote resources. The operator can then decide whether to import, update state, change configuration, or revert the live change.
Agent components need the same discipline. A command like zylos component drift recruit should not immediately repair anything. It should produce a plan:
Component: recruit
Status: Drifted
Source:
expected: zz-howard/zylos-recruit@3b1f59f v0.2.9
installed: zz-howard/zylos-recruit@3b1f59f v0.2.9
Files:
modified:
server/routes/candidates.js
untracked:
backups/server.routes.candidates.js.20260425
generated-stale:
AGENTS.md
Runtime:
pm2 service: zylos-recruit
status: online
cwd: matches
started before last install: yes
Recommended action:
review local file modification before upgrade
restart service after accepting installed state
This plan should classify drift by intent and risk:
- source drift: installed version differs from declared source version
- file drift: installed files differ from expected hashes
- runtime drift: service is running a different path, command, version, or start time
- config drift: config keys or schema differ from expected version
- generated drift: derived files are stale relative to templates
- local patch drift: files changed locally outside the source repo
- backup drift: smart-merge backup files exist and may need review or cleanup
The key is that drift is not always a bug. Terraform explicitly allows operators to record, import, or reconcile drift depending on whether the live change was intended. Agent platforms need the same posture.
The Package-Manager Lesson: Integrity Needs a Local Ledger
Package managers solve part of this problem with lockfiles and integrity hashes. They record exactly which package version and artifact hash was installed. If the artifact changes unexpectedly, the installer can detect the mismatch.
Agent components should borrow this pattern, but extend it beyond package tarballs. A useful component ledger should record:
- component name and version
- source repository and commit
- installer version
- install timestamp
- expected file hashes
- ignored paths
- generated paths and their source templates
- config schema version
- service manager metadata
- postinstall scripts that ran
- migration scripts that ran
This ledger should live next to the component installation and also be referenced from the platform's component database. The local file is useful when the central registry is wrong or unavailable. The database reference is useful for fleet-level queries.
For AI agent systems, the ignored-path design is especially important. Runtime data, OAuth tokens, local config, logs, and memory should not be hashed as if they were source files. The manifest needs a clear boundary between:
- immutable installed code
- generated-but-reproducible files
- mutable runtime data
- secret-bearing config
- operator-owned local patches
Without that boundary, drift reports become noisy and operators stop trusting them.
The Provenance Lesson: Know Where the Artifact Came From
Drift detection answers "what changed." Provenance answers "where did this installed thing come from."
The SLSA specification frames provenance as a way to trace software artifacts back to their source and build process. Sigstore and cosign apply similar ideas through signatures and attestations. The full enterprise supply-chain stack may be too heavy for a small agent component system, but the underlying data model is useful.
At minimum, an agent component install should record:
- source repo URL
- source commit SHA
- version tag
- installer identity
- install command
- build or packaging command, if any
- artifact digest
- timestamp
- verification result
For third-party skills or components, this is a security boundary. If a component can read files, send messages, use credentials, or modify services, then the platform should know whether it was installed from a trusted release, a branch, a local path, or an unknown artifact.
The long-term version of this is signed component releases. The practical first version is a local attestation file that says: "this install claims to be component X at commit Y, installed by tool Z at time T, with these file hashes."
Runtime Drift Is Different From File Drift
Most deployment tools focus on desired versus installed files. Agent components also need runtime drift detection.
A component can be correct on disk and wrong in process. PM2 may still be running a process that started before the last upgrade. The command may point to an old path. Environment variables may have changed in .env, but the process may not have been restarted. A service may be online but serving a stale bundle.
Runtime drift checks should answer:
- Is the service online?
- What command and cwd is it running?
- When did it start?
- Was it started before or after the last install?
- Which version does its health endpoint report?
- Which config file path does it use?
- Does the service manager definition match the component manifest?
- Are there orphaned processes from previous versions?
For PM2-based systems, much of this can be collected from pm2 jlist, service health endpoints, local install metadata, and config files. The detector does not need a full orchestrator. It needs a consistent comparison model.
Runtime drift also creates a better upgrade UX. After an install, the platform can say:
Installed files updated.
Runtime drift remains: service zylos-recruit started before this install.
Run restart to activate the new version.
That is more honest than treating "files copied" as "component shipped."
Smart Merge Backups Are Drift Signals
Agent component systems often need to preserve local edits during upgrades. A smart merge that saves backup files is better than blindly overwriting operator changes. But those backups should not disappear from the operational model.
A backup file means one of three things:
- a local patch was intentionally preserved
- a merge conflict was avoided but still needs human review
- stale backup debris is accumulating and confusing future debugging
All three should be visible.
Backups should be recorded in the installed manifest with:
- original path
- backup path
- reason
- timestamp
- source version before merge
- source version after merge
- whether the operator acknowledged it
This turns backup files from "mysterious leftovers" into actionable state. A drift report can then say:
Backup drift:
2 smart-merge backups exist
1 acknowledged
1 unreviewed
That is enough to support cleanup without losing auditability.
A Minimal Architecture for Agent Component Drift Detection
A lightweight implementation can be built without adopting Kubernetes, OCI registries, or a full signing pipeline.
1. Component Install Manifest
Each install writes .zylos-component/install.json:
{
"name": "recruit",
"version": "0.2.9",
"sourceRepo": "zz-howard/zylos-recruit",
"sourceCommit": "3b1f59f",
"installedAt": "2026-04-25T12:00:00Z",
"installer": "zylos-core@0.4.13",
"files": {
"server/index.js": "sha256:...",
"package.json": "sha256:..."
},
"generated": {
"AGENTS.md": {
"template": "CLAUDE.md",
"generator": "instruction-builder@0.4.13",
"hash": "sha256:..."
}
},
"ignore": [
"data/**",
"logs/**",
"config.local.json",
".env"
]
}
2. Runtime Manifest
Each component declares expected runtime shape:
{
"serviceManager": "pm2",
"serviceName": "zylos-recruit",
"cwd": "~/zylos/.claude/skills/recruit",
"command": "node server/index.js",
"health": "http://127.0.0.1:3020/health"
}
3. Drift Scanner
The scanner collects four snapshots:
- source snapshot from Git or registry metadata
- installed snapshot from the local manifest and file hashes
- runtime snapshot from PM2 and health endpoints
- config snapshot from schema version and declared config paths
Then it emits a classified drift plan.
4. Operator Actions
The detector should suggest but not force actions:
accept-local: record intentional local editsrestore-source: overwrite local edits from sourcerestart: restart runtime serviceregenerate: rebuild generated filescleanup-backups: remove acknowledged backupsimport: bring a local component into the component databaseupgrade: apply a newer source version
The important rule: every mutating action should be explicit and reviewable.
Why Auto-Healing Should Be Conservative
Argo CD supports self-healing when live cluster state deviates from Git. That is powerful in a Kubernetes deployment because the desired state is meant to be declarative and authoritative.
Agent components are messier. Local mutation may represent a live emergency patch, a user-specific integration, a credential migration, or a deliberate temporary override. Automatically erasing those changes can break the exact system the operator is trying to preserve.
So the default should be:
- detect continuously
- classify clearly
- alert only on meaningful drift
- repair only with an explicit command
Auto-healing can still be useful, but only for narrow classes:
- generated files that are purely reproducible
- stale runtime after a successful upgrade, when restart policy allows it
- missing non-secret cache directories
- obsolete acknowledged backup files
For source files, config, credentials, and runtime process replacement, review should come first.
What This Enables
A drift detector sounds like maintenance tooling, but it unlocks larger product capabilities.
Safer Component Upgrades
Before upgrading, the platform can show whether the local component is clean, patched, stale, or running old code. That prevents the common mistake of treating merge as ship.
Better Support and Debugging
Support can ask for a drift report instead of a vague directory listing. The report can reveal stale PM2 processes, mismatched commits, unreviewed backups, or config schema gaps.
Fleet Awareness
If many Zylos instances run the same component, a central view can show which machines are on which version, which have local patches, and which are running stale processes.
Security Review
Third-party components can be flagged when installed from unverified commits, local paths, unsigned artifacts, or modified files.
Regression Confidence
When a component bug is fixed, the platform can verify not only that the PR merged, but that the fix reached the installed files and the running process.
Implementation Roadmap
Phase 1: Local Drift Reports
Build a CLI command that compares installed manifests, file hashes, PM2 status, and source metadata for one component. Output text and JSON. Do not mutate anything.
Success criterion: the report can distinguish clean, source drift, file drift, runtime drift, and backup drift.
Phase 2: Upgrade Gate
Run drift detection before install or upgrade. If local patches or unreviewed backups exist, require explicit operator confirmation. After install, report remaining runtime drift and suggest restart.
Success criterion: upgrades no longer silently overwrite local edits or leave stale runtime state unnoticed.
Phase 3: Provenance and Verification
Record source commit, artifact digest, installer version, and verification result. For trusted components, verify tags or release artifacts. For local installs, mark provenance as local and unverified.
Success criterion: every installed component can answer where it came from.
Phase 4: Fleet Summary
Aggregate drift reports across instances. Track installed version, running version, local patch count, unreviewed backup count, and service health.
Success criterion: operators can see which instances are actually running a merged fix.
Phase 5: Policy and Selective Auto-Heal
Add policies for low-risk automatic actions: regenerate derived files, clean acknowledged backups, restart after selected upgrades, or alert on unverified third-party components.
Success criterion: automation reduces routine cleanup without erasing meaningful local state.
The Core Design Principle
The right mental model is not "keep the install directory clean." It is "make operational truth inspectable."
Agent platforms will always have local state. They will have generated files, credentials, human edits, service restarts, emergency patches, and component-specific migrations. Pretending all of that can be forced into a pure Git checkout creates brittle tooling.
A better system accepts that drift exists, records it, classifies it, and gives the operator reviewable repair paths.
That is the practical bridge between GitOps discipline and agent-runtime reality: Git remains the desired source of truth, but the live system is allowed to speak before the platform changes it.
References
- Argo CD Documentation: Diff Strategies
- Argo CD Documentation: Automated Sync Policy
- Flux Documentation: Drift detection for Helm Releases
- HashiCorp Developer: Manage resource drift
- SLSA Specification v1.2: SLSA specification
- Kubernetes Documentation: Server-Side Apply
- Sigstore Documentation: Verify signatures and attestations with cosign

