AI Agent Credential Management for Headless Deployment
Executive Summary
Deploying an AI agent for 24/7 unattended operation forces a collision between two fundamentally incompatible assumptions: OAuth and keychain-based authentication assume an interactive user session, while process supervisors (PM2, systemd, launchd) run in isolated environments that cannot access those session-bound credentials. The result is a class of failures that only appear after deploy — the agent works fine in the developer's terminal but refuses to start when the machine reboots. Solving this requires a deliberate credential strategy: picking the right token type for headless operation, storing it in a place accessible to daemon processes, and implementing lifecycle management (rotation, expiry handling) that works without human intervention. This article synthesizes current patterns from Claude Code, GitHub Copilot, and the broader AI agent security landscape into practical guidance for autonomous agent deployment.
The Core Problem: Session Isolation in Process Supervisors
The immediate trigger for this analysis was a real deployment scenario: a Claude Code agent running on a Mac Mini couldn't auto-start via PM2 after a reboot, because the OAuth credentials were stored in the macOS keychain — which is only accessible to processes running inside a user session.
This is not a PM2 bug or a macOS quirk. It is a fundamental architectural property of how operating systems separate user session context from system daemon context:
macOS: The Data Protection Keychain is only available to processes running in a user context (apps, user agents). launchd daemons and processes started at system boot run outside any user session and cannot access it. Apple's own developer documentation states: "Programs that run outside of a user context, like a launchd daemon, must target the file-based keychain." PM2 startup scripts, when generated with pm2 startup, create a launchd plist that starts at boot time — before any user logs in — placing it squarely outside keychain reach.
Linux (systemd): Systemd services run in their own isolated environment. By default, they do not inherit the calling user's shell environment. This means ~/.bashrc exports, GNOME Keyring unlocks, and session-level export statements are invisible to a systemd unit unless explicitly configured with Environment= or EnvironmentFile= directives.
tmux / screen sessions: These survive terminal disconnects but are bound to the user session that created them. If the host reboots, tmux sessions are gone. This is a reasonable workaround for development but not a production deployment strategy.
The three process supervisor models and their credential access behavior:
| Supervisor | Starts at | User session required | Keychain access |
|---|---|---|---|
| launchd Daemon (macOS) | System boot | No | No (file-based only) |
| launchd Agent (macOS) | User login | Yes | Yes |
| systemd system service | System boot | No | No |
| systemd user service | User login | Yes | Limited |
| PM2 (startup mode) | System boot via launchd/systemd | No | No |
| PM2 (manual mode) | User terminal | Yes | Yes |
| Docker container | Service start | No | No |
The pattern is consistent: anything that auto-starts at boot, without a user login, cannot access session-bound credentials.
Service Account vs. User Account Patterns
The most important architectural decision for a headless AI agent is whether it authenticates as a user (delegated access) or as a service (machine identity).
User Account Pattern
The agent authenticates using the user's OAuth credentials. It acts on behalf of the user, which means:
- Access tokens carry user-level permissions
- Rate limits and quotas are scoped to the user
- Token refresh requires the user's refresh token, which may be tied to their interactive session
This is what Claude Code does by default. When you run claude /login, you authenticate via browser OAuth and the resulting ~/.claude/.credentials.json contains your personal access token and refresh token. The agent then acts as you.
The headless problem surfaces because this file was created in an interactive session. The refresh token inside it may work fine in another machine or process if the process can read the file — but if PM2 starts as a system daemon, it may run as a different user or in an environment where the path ~/.claude/ resolves differently.
Service Account / API Key Pattern
The agent authenticates using a long-lived API key issued to a service identity, not tied to any individual user session. For Claude Code, this means setting ANTHROPIC_API_KEY to a key from your Anthropic Console.
Advantages for headless deployment:
- No OAuth flow, no browser, no interactive consent
- The key is a plain string that can be placed in environment files
- No expiry (until manually rotated), so no background refresh process needed
- Works identically across launchd, systemd, Docker, and any other execution context
The tradeoff: API keys are long-lived secrets. A leaked key has an indefinitely wide blast radius unless you catch it and rotate immediately. The principle of least privilege suggests scoping keys as narrowly as possible.
For Claude Code specifically, the recommended CI/CD pattern is exactly this:
# Set in PM2 ecosystem file or system environment
ANTHROPIC_API_KEY=sk-ant-xxxx
# Then run Claude Code headlessly
claude -p "your task here" --output-format json
Hybrid: Setup Tokens for Headless OAuth
Claude Code introduced a third option: claude setup-token, which generates a long-lived OAuth token that can be stored as the CLAUDE_CODE_OAUTH_TOKEN environment variable. This bridges the gap — it's still OAuth (so you retain user-level authorization semantics) but stored as a plain environment variable that works in daemon contexts.
This pattern — a "setup token" or "deployment token" that is OAuth-derived but environment-variable-compatible — is now common across AI coding tools:
- Claude Code:
CLAUDE_CODE_OAUTH_TOKENviaclaude setup-token - GitHub Copilot CLI: Supports
COPILOT_GITHUB_TOKEN,GH_TOKEN, orGITHUB_TOKENenvironment variables for non-interactive authentication. If the system keychain is unavailable (e.g., headless Linux), it falls back to a plaintext config file at~/.copilot/config.json. - GitHub Actions: Uses
GITHUB_TOKENinjected by the runner, never a user session
Token Lifecycle Management
The Refresh Token Problem in Headless Contexts
OAuth access tokens expire — often within 10–60 minutes. In an interactive session, the OAuth client silently uses the refresh token to get a new access token. In a headless context, this background refresh can fail in subtle ways:
- The process has read access to the credentials file but write access is blocked (e.g., running as a different user)
- The refresh endpoint requires the
X-Device-IDor session context that was present during initial auth - Multiple concurrent agent instances race to refresh the same token, causing a refresh token race condition (the first refresh invalidates the token family, leaving other instances with stale tokens)
Claude Code's GitHub issues document all three of these failure modes. Issue #21765 describes credentials copied to remote machines failing with 401 because the refresh token wasn't used. Issue #24317 describes concurrent sessions triggering refresh races. Issue #28827 describes OAuth refresh failing entirely in non-interactive mode.
The practical solution: For unattended deployment, prefer tokens that do not require background refresh:
- Long-lived API keys (
ANTHROPIC_API_KEY) — no expiry, no refresh - Setup tokens with extended TTL (
CLAUDE_CODE_OAUTH_TOKEN) — refreshes less frequently - Service account client credentials (OAuth Client Credentials flow) — each invocation fetches a fresh short-lived token from the authorization server using the stable client ID and secret
Token Rotation Strategy
Even static API keys should be rotated on a schedule. The recommended operational pattern:
# In a cron job or scheduled task (e.g., every 30 days):
# 1. Generate new key from Anthropic Console
# 2. Update the credential store (secrets manager, env file, Docker secret)
# 3. Restart the agent process
# 4. Verify agent is functioning
# 5. Revoke the old key
For production deployments using a secrets manager (Vault, AWS Secrets Manager), the rotation can be automated:
# AWS Secrets Manager automatic rotation example
aws secretsmanager rotate-secret \
--secret-id zylos-agent-anthropic-key \
--rotation-lambda-arn arn:aws:lambda:us-east-1:...
The rotation Lambda fetches a new API key via the Anthropic Console API, updates the secret, and optionally sends a signal to the agent process to reload its credentials.
Credential Storage Approaches
Environment Variables in PM2 Ecosystem Files
The most practical approach for PM2-managed deployments: store credentials in the PM2 ecosystem file or in a separate .env file that PM2 loads at startup.
// ecosystem.config.js
module.exports = {
apps: [{
name: 'zylos-agent',
script: './bin/agent.js',
env: {
NODE_ENV: 'production',
ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY,
},
env_file: '.env.production', // PM2 >= 5.x supports env_file
}]
};
The env_file directive in PM2 5.x loads a dotenv file at process start. This keeps secrets out of the ecosystem file itself (which may be checked into version control) while still making them available to the daemon process.
Critical: The .env.production file must be readable by the user that PM2 runs as. If PM2 starts at system boot as root but the env file is owned by your user account with 600 permissions, it will fail silently.
# Ensure the env file is readable by the PM2 user
chmod 600 ~/.env.production
chown youruser:youruser ~/.env.production
# If PM2 runs as root at boot, you may need:
chmod 640 ~/.env.production
chown youruser:root ~/.env.production
Systemd EnvironmentFile
For systemd-managed services, the EnvironmentFile directive is the standard approach:
# /etc/systemd/system/zylos-agent.service
[Unit]
Description=Zylos AI Agent
After=network.target
[Service]
Type=simple
User=zylos
WorkingDirectory=/home/zylos/zylos
EnvironmentFile=/etc/zylos/agent.env
ExecStart=/usr/bin/node /home/zylos/zylos/bin/agent.js
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# /etc/zylos/agent.env (owned by root, readable by zylos user)
ANTHROPIC_API_KEY=sk-ant-xxxx
CLAUDE_CODE_OAUTH_TOKEN=...
NODE_ENV=production
# Set permissions: root owns it, zylos can read it
chown root:zylos /etc/zylos/agent.env
chmod 640 /etc/zylos/agent.env
Docker Secrets (Swarm) and Compose Secret Mounts
In container deployments, environment variables are visible in docker inspect output and can appear in process listings. Docker's secret mechanism mounts secrets as files in the container's /run/secrets/ directory, which is an in-memory tmpfs — never written to disk, not visible in image layers or container metadata.
# docker-compose.yml
version: '3.8'
services:
zylos-agent:
image: zylos/agent:latest
secrets:
- anthropic_api_key
environment:
- NODE_ENV=production
# Read the secret from file in the entrypoint
command: ["/bin/sh", "-c", "export ANTHROPIC_API_KEY=$(cat /run/secrets/anthropic_api_key) && node bin/agent.js"]
secrets:
anthropic_api_key:
file: ./secrets/anthropic_api_key.txt
# Or from external secret manager:
# external: true
For Kubernetes, the equivalent pattern uses secretKeyRef in the pod spec:
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: zylos-agent-secrets
key: anthropic-api-key
External Secrets Managers
For production-grade deployments, an external secrets manager adds rotation, audit logging, and access policy management:
HashiCorp Vault — The agent's process identity (Kubernetes service account, AWS IAM role, or Vault AppRole) is verified at startup. Vault issues a short-lived token that grants read access to the specific secrets the agent needs. The agent reads its credentials from Vault's HTTP API at startup, and the credentials themselves can be dynamically generated (e.g., a Vault AWS secrets engine generates a scoped IAM credential on request).
AWS Secrets Manager — Secrets are stored encrypted at rest (KMS), versioned, and can trigger automatic rotation via Lambda. The AWS Secrets Manager Agent (a local proxy process) handles SSRF protection and caching, so the agent fetches credentials from localhost rather than the AWS endpoint directly.
# Fetching credential at runtime instead of from environment
import boto3
def get_anthropic_key():
client = boto3.client('secretsmanager', region_name='us-east-1')
response = client.get_secret_value(SecretId='zylos/anthropic-api-key')
return response['SecretString']
The key principle with external secrets managers: credentials are fetched at runtime, used in memory, and never written to disk or environment files. This eliminates the file permission problem entirely but introduces a dependency on the secrets manager's availability.
Security Considerations
Principle of Least Privilege
An autonomous AI agent that can read and write files, make API calls, and execute code is a high-privilege process. The credentials it holds should be scoped to exactly what it needs:
- Use a dedicated API key for the agent, not a shared organization key
- If the agent only reads from certain APIs, do not give it write scopes
- Prefer service accounts with explicit permission grants over user accounts with inherited permissions
- If using OAuth with the On-Behalf-Of (OBO) pattern, ensure the delegated scopes are minimally scoped
The OWASP NHI Top 10 (released June 2025) specifically flags "over-provisioned access" as the top risk for non-human identities. Enterprises report that 97% of NHI credentials carry excessive privileges.
Credential Leakage Prevention
The most common leakage vectors for AI agent credentials:
- Version control:
.envfiles accidentally committed. Mitigation:.gitignore, pre-commit hooks (git-secrets, trufflehog), and never storing secrets inecosystem.config.jsdirectly. - Log output: Agents that log their environment or print error objects may expose credential values. Mitigation: redact secrets from log formatters; never log full
process.env. - LLM context injection: Credentials in the agent's working directory or injected into prompts. Mitigation: the Brokered Credentials pattern — the LLM never sees the raw credential. A middleware layer injects the credential when making the actual API call, after the LLM has determined what action to take.
- Container image layers: Credentials baked into
RUNsteps in a Dockerfile persist in the image layer even if deleted in a later step. Mitigation: use Docker BuildKit--secretmounts for build-time secrets.
Token Blast Radius Management
Even with all precautions, assume credentials will occasionally be compromised. Design for rapid response:
- One credential per deployment — a compromised key can be revoked without affecting other services
- Short-lived tokens where possible — a 15-minute access token has a 15-minute blast radius even if stolen
- Audit logs at the API level — know immediately when unusual access patterns appear
- Automated revocation workflows — the ability to invalidate a credential and restart affected services in under 5 minutes
Practical Recommendation: Tiered Approach by Deployment Complexity
Simple single-host deployment (e.g., Mac Mini running PM2):
- Generate a dedicated API key from Anthropic Console
- Store it in a
~/.env.agentfile with600permissions - In your PM2 ecosystem file, reference it via
env_fileor read it via a startup script - Test that PM2 can read the file when started via
pm2 startup(test by rebooting, not just restarting the process)
# Test the actual boot path — don't assume it works from your terminal
sudo systemctl stop pm2-youruser
sudo systemctl start pm2-youruser
journalctl -u pm2-youruser -n 50 # Check for env loading errors
Container-based deployment:
- Use Docker secrets or environment injection from a secrets manager
- Never bake credentials into the image
- Implement a startup healthcheck that verifies credential validity before accepting work
Multi-host or cloud deployment:
- Use an external secrets manager (Vault or cloud-native)
- Bind secret access to workload identity (IAM role, Kubernetes service account)
- Implement credential rotation with zero-downtime reload
- Add monitoring for authentication failures as a leading indicator of credential expiry or compromise
Emerging Standards and Tooling
The AI agent security space is consolidating around several patterns:
SPIFFE/SPIRE — A workload identity framework that issues short-lived X.509 certificates and JWTs to processes based on their identity (which pod, which node, which service account). Used as the trust anchor for Vault's Kubernetes auth method and increasingly as the foundation for inter-agent authentication in multi-agent systems.
OAuth 2.1 + PKCE as baseline — RFC 9700 (January 2025) and the MCP specification mandate OAuth 2.1 with PKCE for any AI tool API that handles user data. The Client Credentials flow is the standard for M2M service accounts.
Agentic Secrets Infrastructure — Emerging tooling (e.g., AgentSecrets, Aembit) specifically designed for AI agent deployments, providing credential brokering, just-in-time provisioning, and the ability to revoke access mid-task. These tools implement the pattern where the LLM never holds credentials directly — the infrastructure layer intercepts tool calls and injects credentials transparently.
OWASP NHI Top 10 (June 2025) — The formalization of non-human identity security as a discipline, with standardized vulnerability categories and remediation guidance. NHI-01 (Improper Offboarding), NHI-02 (Secret Leakage), and NHI-05 (Overprivileged NHI) are directly applicable to AI agent deployments.
Sources:
- Claude Code Authentication Docs
- Claude Code Issue #7100: Headless/Remote Authentication
- Claude Code Issue #21765: OAuth refresh on headless machines
- Claude Code Issue #28827: OAuth token refresh fails in non-interactive mode
- Aembit: Securing AI Agents Without Secrets
- Aembit: 4 Most Common AI Agent Deployment Patterns
- Auth0: Four Identity Security Essentials for AI Agents
- Apple Developer Forums: launchctl LaunchDaemons and keychain access
- GitHub Copilot CLI Authentication
- PM2 Environment Variables Best Practices
- Composio: Secure AI Agent Infrastructure Guide
- Entro: Why Least Privilege Matters for NHIs in Agentic AI
- Docker Secrets Documentation
- Scalekit: OAuth for AI Agents Architecture

