Session Bootstrap Context Budgets: Designing What Persistent AI Agents Load at Startup
Executive Summary
Every long-lived agent framework converges on the same shape for session bootstrap, even though none of them describe it in exactly these terms: a small, stable, cacheable prefix of standing instructions, a pointer layer to larger stores of context that get pulled in on demand, and a volatile tail of recent history that changes every turn. The size of the stable prefix is consistently kept small - hundreds to low-thousands of tokens for "core" instructions (Letta's core memory blocks, Anthropic's own CLAUDE.md guidance, OpenAI Codex's project_doc_max_bytes cap) - not because larger context windows can't technically hold more, but because two independent effects punish bloat: (1) context rot, the measurable, non-uniform degradation in model reliability as input grows, which Chroma's research shows begins well before the nominal context limit (accuracy drops from ~95% to ~60% on some tasks, and 30%+ drops when the needed information sits mid-document) - the classic "lost in the middle" U-shaped curve; and (2) caching economics - Anthropic's prompt cache makes cached tokens ~10x cheaper than fresh ones, but only if the prefix is byte-identical across calls, so anything volatile (timestamps, live state, freshly summarized history) must be pushed after the stable instructions, not mixed into them.
The dominant architectural pattern across Claude Code, Codex, Letta/MemGPT, LangGraph, CrewAI, and Manus is tiered memory with progressive disclosure: a lean, always-injected identity/instruction layer; a pointer/reference layer that names where more detail lives (files, archival stores, vector DBs) without inlining it; and true content loaded "just-in-time" only when a specific task needs it. Ordering matters for two independent reasons - attention economics (recency bias means content placed right before generation gets disproportionate weight, which is why Manus's "recitation" trick works, and why volatile state/history should sit closest to the end) and caching economics (stable content must sit first and never change, or the entire downstream cache invalidates). Failure modes are well documented: bloated bootstraps that eat a large share of the window before the user says anything, triggering premature compaction; stale injected state that contradicts live reality (formalized in the recent STALE benchmark, where agents act confidently on invalidated memories); and instruction dilution, where a CLAUDE.md exceeding tens of thousands of tokens gets silently ignored because it exceeds the model's "effective" instruction-following capacity (empirically ~150-200 instructions per Claude Code's own system prompt budget).
For a platform like Zylos, the actionable synthesis is: keep identity + state + references as a hard-capped, cacheable prefix (order: identity -> references/pointers -> state, all stable and rarely-changing); treat recent conversation history as the necessarily-volatile tail that goes last, right before the live turn, to exploit recency for "what just happened"; enforce byte-level stability in the cacheable segment (no live timestamps, no reformatted/reordered content) to preserve Anthropic prompt-cache hit rate; and use size gates plus tiered pointer files (references.md -> reference/*.md) rather than ever inlining full history or full profiles at boot.
The Design Space
Every persistent agent framework must answer the same question at process start: of everything the agent could theoretically know, what actually goes into the first turn's token budget? This is the cold-start context injection problem, and it has three competing pressures:
- Completeness - the agent should behave consistently with its accumulated identity, obligations, and user relationship, which argues for injecting as much standing context as possible.
- Model reliability - every framework surveyed here has independently discovered that stuffing more tokens into the front of a context window does not scale linearly in usefulness; it actively degrades reliability on unrelated tasks later in the same context (context rot).
- Cost/latency - tokens injected at boot are paid for on every single session, multiplied by however many times the agent restarts (compaction, rotation, crash recovery), so bootstrap size is a recurring cost multiplier, not a one-time fee.
These pressures push designs toward the same solution independently: separate "must always be true" (identity, rules) from "might be relevant" (memory, history) from "definitely relevant right now" (the live task), and load them with decreasing eagerness. This is the load-bearing idea behind nearly every framework below.
How Major Frameworks Handle It: Comparison
| Framework | Bootstrap mechanism | What's always loaded | What's deferred / pointer-based | Documented limits |
|---|---|---|---|---|
| Claude Code | SessionStart hook injects hookSpecificOutput.additionalContext before first user turn; CLAUDE.md read as project doc | CLAUDE.md (project instructions), hook output (e.g., git status) | Referenced files (@import), skills loaded on demand | Warns/degrades near 40KB CLAUDE.md; recommends <200 lines; auto-compact historically ~95% of window, community pressure moved practical thresholds to 66-75% used, ~33-45K token reserved buffer |
| OpenAI Codex CLI | Instruction chain built once per session: AGENTS.override.md -> AGENTS.md, walked from repo root down to cwd | First non-empty AGENTS.md per directory level, first turn only | project_doc_fallback_filenames for extra files | project_doc_max_bytes config (example: 65536 bytes) caps how much of each file is read |
| Letta / MemGPT | OS-inspired "core memory" always resident in context; agent self-edits via function calls | Core memory blocks: persona, human, custom - labeled, size-limited | Archival memory (Postgres+pgvector), retrieved by similarity search on demand; recall memory for full conversation history | Core memory blocks recommended at "a few hundred tokens at most" per block, hard per-block token caps configurable |
| LangGraph | Checkpointer restores thread-scoped state automatically; Store is separate, explicit read/write | Whatever the last checkpoint captured for thread_id (structural, automatic) | Cross-thread long-term memory in namespaced Store - deliberately not automatic; developer decides what/when to write and retrieve | No fixed numeric limits; explicit design choice that checkpoint (short-term) is structural/automatic while long-term memory is a product decision |
| CrewAI | Unified Memory class assembles short-term (ChromaDB/RAG, current run), long-term (SQLite, cross-session lessons), entity memory (RAG) | Short-term memory active for current crew execution | Long-term/entity memory retrieved via RAG only when relevant | No published token budgets; retrieval-based, so budget is bounded by top-k retrieval config |
| OpenHands | Event stream + "condenser" plugin decides whether to compress history before each LLM call | keep_first N initial events always preserved | Older events replaced by LLM-generated summary once max_size exceeded; attention_window bounds unmasked recent observations | Condensation cuts API cost ~50%, changes cost scaling from quadratic to linear over a long session |
| Manus | File-system-as-memory; "recitation" (rewriting todo.md) pushes goals into recent attention; strict KV-cache-stable prefix | System prompt + stable prefix (never mutated in-place, append-only) | Full tool outputs / web pages / documents dropped from context but referenced by path/URL, restorable on demand | Cached vs uncached token cost quoted at 10x difference (Claude Sonnet: $0.30 vs $3.00/MTok); avg input:output ratio ~100:1 in production agents |
| Anthropic (general agent guidance) | No single mechanism prescribed; "context engineering" principles apply to whatever harness is used | "Minimal high-signal" system prompt sections (background_information, instructions) | "Just-in-time" retrieval via lightweight identifiers (file paths, queries, links); structured note files (e.g. NOTES.md) pulled back in as needed | No universal token %, but explicit compaction pattern: summarize + carry forward "the five most recently accessed files"; sub-agent reports condensed to ~1,000-2,000 tokens |
A pattern is visible across all seven: the "always loaded" tier is deliberately small and either hard-capped (Codex's byte limit, Letta's per-block token limit, Claude Code's 40KB warning) or automatically pruned (OpenHands condenser, LangGraph checkpoint-only). Nobody's default architecture inlines the full history or full user profile at boot; the differentiator is how the deferred tier is addressed - function-call retrieval (Letta), RAG (CrewAI), explicit store reads (LangGraph), or file-path pointers (Manus, Anthropic's NOTES.md pattern, and structurally, Zylos's own references.md).
Context Budget Economics: Caching and Degradation
Prompt caching mechanics (Anthropic). A cache breakpoint is a cache_control: {"type": "ephemeral"} marker on a content block; up to four breakpoints are allowed, with a default 5-minute TTL or an optional 1-hour TTL for bursty/batch workloads. Anthropic hashes the rendered prompt in the order tools -> system -> messages, up to each breakpoint, and reuses cached compute when the hash matches a prior request. Critically, this is a prefix match on byte-identical content - a single-token difference anywhere before a breakpoint invalidates the cache from that point forward. This has a direct architectural consequence for session bootstrap: anything that changes between sessions (timestamps, live state values, freshly-fetched data) must be placed after the stable prefix, or every session pays full, uncached price for the entire prompt. Manus's engineering write-up quantifies the stakes concretely: with Claude models, cached input tokens run $0.30/MTok vs $3.00/MTok uncached - a 10x difference - and explicitly warns against putting second-precision timestamps at the front of a prompt for exactly this reason. (Source: Anthropic prompt-caching docs, platform.claude.com; Manus engineering blog.)
Context rot. Chroma's technical research tested 18 frontier models (GPT-4.1, Claude 4, Gemini 2.5, Qwen3, etc.) and found that every one degrades measurably as input length grows - and that this degradation is non-uniform and begins well before the nominal window limit (a 200K-token model can show significant degradation by 50K tokens). Degradation severity depends on needle-question similarity, presence of distractors, and haystack structure, not just raw length. This directly undercuts the intuitive argument "context windows are huge now, so injection size doesn't matter" - reliability cost accrues continuously, not just at overflow.
Lost in the middle. The foundational finding (Liu et al., 2023/2024, TACL) is a U-shaped performance curve: models retrieve information best when it's at the very start (primacy) or very end (recency) of context, and significantly worse when it's buried in the middle - true even for models explicitly built for long context. Mechanistically, this is linked to attention sinks, where early tokens absorb disproportionate attention regardless of informational salience, compounding through layers, combined with a recency pull from causal masking and induction-head dynamics. This is not a training artifact that will simply be fixed by better models - a 2026 theoretical paper ("A Structural Theory of Position Bias in Transformers") argues the primacy/recency tilt is architecturally inherent to the causal-attention mechanism itself.
Practical numeric anchors. Several independent sources converge on similar order-of-magnitude figures for "how much standing instruction is too much":
- System prompts: a commonly cited sweet spot is 200-2,000 tokens for well-defined, consistent behavior; production conversational systems often target under 1,500 tokens.
- Response latency degrades gradually starting ~2,000 tokens even before accuracy does.
- Claude Code's own CLAUDE.md guidance: under ~200 lines / 40KB, with the rationale that "frontier models reliably follow only about 150-200 instructions, and Claude Code's system prompt already uses roughly 50 of them" - i.e., there is a real, finite instruction-following budget shared between the harness's own system prompt and user-supplied standing instructions, not just a raw token ceiling.
- Letta/MemGPT core memory blocks: recommended at "a few hundred tokens at most" per block, deliberately far smaller than the model's context window, because core memory competes with everything else the agent needs to reason about in the same turn.
These numbers should be read as order-of-magnitude anchors, not hard universal constants - they vary by model and task - but the directional agreement across independent teams (Anthropic, Letta, Claude Code linting tools, third-party prompt-engineering benchmarks) is a stronger signal than any single number.
Injection Ordering Effects
Two distinct mechanisms both argue for the same practical rule - stable-and-general first, volatile-and-specific last - but for different reasons, and it's worth keeping them separate because they interact:
-
Caching (byte-level). The cache is a literal prefix hash. Anything that must be identical across many sessions (identity, standing rules, rarely-changing references) belongs as early as possible and must never be reformatted, reordered, or interpolated with live values (no embedded "as of {timestamp}" at the top). Anything that changes constantly (today's conversation, live state that gets rewritten) belongs after the last cache breakpoint.
-
Attention (semantic/positional). Independent of caching, transformer attention exhibits primacy bias (early-token attention sinks) and recency bias (the most recent tokens before generation get outsized weight in shaping the next output), with a well-documented trough in between. This is why Manus's "recitation" pattern - repeatedly rewriting a todo.md-style summary of the current goal and appending it near the end of context - works: it deliberately exploits recency bias to keep long-horizon goals from being "lost in the middle" as the agent's own tool-call history grows. The same logic implies that injecting recent conversation history last, right before the live turn, is not just conventional but functionally important - it's the piece of context that most needs to be "front of mind" for producing a contextually appropriate reply, whereas identity and standing rules are exactly the kind of stable, rarely-decisive-in-the-moment content that can tolerate sitting in the (weaker) primacy zone as a smaller, front-loaded block.
-
Instruction hierarchy (a related but distinct concept). OpenAI's "Instruction Hierarchy" work (and the corresponding Model Spec) establishes a priority ordering - system > developer > user > tool - for conflict resolution, not physical token placement. It's worth distinguishing from the ordering effects above: instruction hierarchy is about which instruction wins when two conflict; primacy/recency and caching are about where content should physically sit in the token stream. A well-designed bootstrap should encode both: the highest-priority instructions (identity/behavioral rules) should also generally sit in the stable, front-loaded, cacheable block, which happens to align nicely - but the two concerns are logically separable, and a designer shouldn't assume solving one solves the other.
Tiered Memory and Progressive Disclosure
The consistent pattern across frameworks is a two-or-three-tier memory design:
- Tier 1 - always resident, lean. Letta's core memory blocks; Claude Code's CLAUDE.md; Zylos's own identity.md/state.md/references.md. Kept deliberately small (hundreds to low-thousands of tokens) because it is paid for and processed on every turn of every session.
- Tier 2 - pointer/reference layer. Not the content itself, but where to find it: file paths, archival-memory query hooks, vector-DB namespaces. Zylos's references.md is explicitly this tier - "a pointer file... point to the source config file instead" rather than duplicating values. Anthropic's guidance names this "just-in-time retrieval": maintain lightweight identifiers and dynamically load data at runtime rather than pre-processing everything upfront.
- Tier 3 - on-demand full content. Loaded only when Tier 1/2 signals relevance: a user profile read only when interacting with that user; a decisions.md read only when making a related decision; archival memory retrieved only via similarity search when the agent's own recall triggers it.
A concrete real-world instance of this two-tier design (found in an independent agent-memory write-up, not Zylos-specific) matches this almost exactly: a project instruction file read every session records "accomplished goals, open questions, key formulas... and documented dead ends," while full topic detail is retrieved on demand via keyword search - explicitly designed to "keep the agent's context window small at startup while providing access to the entire accumulated knowledge base when a specific topic becomes relevant." This is architecturally identical to the identity/state/references (always) vs. reference/.md and users//profile.md (on-demand) split.
Structured note-taking as a variant. Anthropic's guidance and OpenHands both converge on maintaining a living, periodically-rewritten note/summary artifact (NOTES.md, progress summaries, claude-progress.txt) that gets regenerated rather than accumulated - the file is small and current by construction, rather than growing monotonically and needing separate pruning. Anthropic's "effective harnesses for long-running agents" article documents claude-progress.txt explicitly as the memory bridge between context windows: the agent that finishes a session writes it, the agent that starts the next session reads it first, before touching code - a clean instance of a deliberately ordered, minimal, regenerated bootstrap artifact.
Truncation, Summarization, and Compaction Strategies
Several distinct techniques recur, at different points in the pipeline:
- Tool-result / stale-output clearing. Described by Anthropic as "one of the safest, lightest-touch forms of compaction" - removing redundant tool outputs from message history without touching the semantic summary of what happened. Manus's equivalent is "restorable compression": drop the full body of a web page or file but keep the URL/path, so the information isn't permanently lost, just deferred back to Tier 2/3.
- LLM-generated rolling summarization. OpenHands's condenser replaces dropped events with an LLM-generated summary once history exceeds max_size, while always preserving the first keep_first events (typically the task framing) and an attention_window of the most recent observations unmasked. This changes total-session cost scaling from quadratic to linear and cut OpenHands's own API costs roughly 50% in their reported benchmark.
- Whole-window compaction with carryover. Claude Code's approach (per Anthropic's own context-engineering write-up): summarize the full context, discard the rest, and reinitiate with the summary plus a small explicit carryover - concretely, "the five most recently accessed files" - rather than trying to summarize everything with equal fidelity. The guidance explicitly recommends "maximize recall first, then iterate to improve precision" when tuning what a compaction summary preserves (architectural decisions, unresolved bugs, implementation details are called out as must-keep).
- Sub-agent result condensation. When work is delegated to a sub-agent that may internally consume tens of thousands of tokens, the parent context only receives a condensed summary - reported at roughly 1,000-2,000 tokens - deliberately re-establishing the tiered-budget principle at the sub-agent boundary, not just at session-boot.
- Byte/line hard caps as a blunt instrument. Codex's project_doc_max_bytes and Claude Code linters' 40KB CLAUDE.md warning threshold are simpler, non-semantic truncation strategies: rather than summarizing, they simply refuse to read past a limit. This is crude but predictable, and useful as a hard backstop underneath smarter summarization.
Failure Modes and Anti-Patterns
- Bloated bootstrap causing premature compaction. If the always-loaded tier grows large (long CLAUDE.md, verbose session-start hook output, an over-eager conversation-history injection), it eats into the effective window before the user's actual request even arrives, pushing auto-compaction thresholds closer and forcing more frequent, lossy summarization cycles. Community reports around Claude Code's auto-compact threshold (historically defaulting near 95% of window use, with practitioners pushing custom thresholds to 40-70% for more headroom, and a documented reduction of the reserved buffer from ~45K to ~33K tokens) show this is a live, actively-tuned tradeoff in production, not a solved problem.
- Instruction dilution / silent truncation. An extreme documented case: a 40,000-line CLAUDE.md was effectively ignored by the model because it exceeded both hard byte limits (triggering truncation) and the model's practical instruction-following capacity (~150-200 instructions). The failure is silent - the model doesn't error, it just quietly drops fidelity to instructions, which is more dangerous than an explicit error because it's not immediately visible to the operator.
- Stale injected state vs. live reality. The recent STALE benchmark formalizes this: agents that retrieve or receive an injected memory rarely fail to find it, but regularly act confidently on an outdated belief once a later observation has implicitly invalidated it (no explicit correction was ever logged). Its authors report best-model accuracy of only 55.2% on detecting such implicit staleness, and recommend that any retrieved/injected content be treated as "a hypothesis until re-checked against the live environment," with an explicit staleness penalty and confidence gating rather than injecting it as unconditional fact. This maps directly onto the risk of injecting a state.md snapshot at boot without also cheaply verifying live state where it's cheap to check (e.g., don't inject "pending task X" as fact if a quick live check would show it already completed).
- Duplicated context. Both Zylos's own guidance (references.md is a pointer file; never duplicate .env values in it) and Manus's file-system philosophy converge on the same anti-pattern: copying live values into a summary file creates two sources of truth that can drift, and doubles token cost for content that's already available at Tier 2/3. The correct pattern is a pointer, not a copy.
- Cache invalidation from non-deterministic serialization. A subtle but costly failure: if the bootstrap-context assembly process reorders keys, reformats timestamps to second precision, or interpolates any per-request value into the "stable" segment, every request pays the ~10x uncached price silently, with no functional bug ever surfacing - only an invisible cost regression. Manus explicitly calls out deterministic serialization (e.g., JSON key ordering) as a caching requirement, not just a nicety.
- Confusing priority ordering with placement ordering. A design can correctly rank "identity rules override user requests" (instruction hierarchy) while still botching physical placement (e.g., putting the highest-priority rules in the attention trough of a long prompt, or putting them after volatile content that breaks the cache). These are separate axes and a bootstrap design needs to satisfy both.
Actionable Recommendations for Agent Platform Designers
- Cap the always-loaded tier explicitly and enforce it mechanically, not just by convention - a byte/line limit (Codex-style project_doc_max_bytes, Claude-Code-linter-style 40KB warning) catches drift that "please keep it short" documentation won't. Treat identity + state + references collectively as one budget, not three independent unlimited files.
- Order the stable prefix first, and never mutate it per-request. Identity -> references/pointers -> state should be assembled deterministically (stable key order, no live timestamps, no per-session interpolation) so it forms one unbroken cacheable prefix. Anything genuinely volatile (recent conversation history, freshly fetched live data) belongs strictly after it, ideally right at the end, both to preserve the cache and to exploit recency bias for the content that most needs to be "front of mind" for the current turn.
- Use pointers, not copies, for anything with a canonical source elsewhere - env files, per-user profiles, project docs. This avoids both duplicated-context bloat and the stale-copy failure mode; a pointer forces a fresh read at time of actual need rather than trusting a snapshot that may already be wrong.
- Treat injected "state" as a hypothesis, not fact, when correctness matters and a live check is cheap. Where the state file claims something time-sensitive (a task is pending, a service is running), prefer a cheap live verification over trusting the snapshot, following the STALE finding that agents rarely fail to retrieve stale memory - they fail to doubt it.
- Regenerate summaries rather than accumulate them. Progress/state files should be periodically rewritten to reflect current truth (Anthropic's claude-progress.txt pattern, OpenHands's condenser replacing rather than appending), so the always-loaded tier stays small and current by construction, instead of needing separate staleness detection later.
- Reserve headroom below the compaction threshold, and treat the bootstrap budget as a fixed tax against it. If auto-compaction fires at X% of window, the bootstrap injection should be sized as a small, known fraction of the remaining room (post-system-prompt), not sized independently - a growing bootstrap silently shrinks the effective working window available for the actual task before compaction kicks in.
- Separate "priority for conflict resolution" from "position in the token stream" as two explicit design decisions, and don't assume solving one solves the other - high-priority behavioral rules still need a physical placement strategy that accounts for attention decay and caching, not just a semantic ranking.
References
- Anthropic, "Effective context engineering for AI agents" - https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents - system prompt altitude calibration, just-in-time retrieval, compaction, sub-agent summary sizing.
- Anthropic, "Effective harnesses for long-running agents" - https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents - claude-progress.txt bootstrap pattern, session-start diagnostic sequence.
- Anthropic, "Prompt caching" - https://platform.claude.com/docs/en/build-with-claude/prompt-caching - cache breakpoints, TTL, byte-identical prefix requirement.
- Yichao "Peak" Ji / Manus, "Context Engineering for AI Agents: Lessons from Building Manus" - https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus - KV-cache economics (10x cost differential), stable-prefix requirements, recitation/attention manipulation, restorable compression, file-system-as-memory.
- Chroma, "Context Rot: How Increasing Input Tokens Impacts LLM Performance" - https://www.trychroma.com/research/context-rot - 18-model evaluation, degradation well before context-limit overflow.
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (TACL 2024) - https://arxiv.org/abs/2307.03172 (code at https://github.com/nelson-liu/lost-in-the-middle) - U-shaped primacy/recency retrieval curve.
- "On the Emergence of Position Bias in Transformers" - https://arxiv.org/html/2502.01951 and "A Structural Theory of Position Bias in Transformers" - https://arxiv.org/pdf/2602.16837 - architectural origins of primacy/recency (attention sinks, causal masking, induction heads).
- OpenAI, "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions" - https://openai.com/index/the-instruction-hierarchy/ and the Model Spec - https://model-spec.openai.com/2025-12-18.html - system > developer > user > tool conflict-resolution ordering.
- OpenAI Developers, "Custom instructions with AGENTS.md" - https://developers.openai.com/codex/guides/agents-md and "Configuration Reference" - https://developers.openai.com/codex/config-reference - AGENTS.override.md discovery order, project_doc_max_bytes, project_doc_fallback_filenames.
- Leonie Monigatti, "Virtual context management with MemGPT and Letta" - https://www.leoniemonigatti.com/blog/memgpt.html and "MemGPT: Towards LLMs as Operating Systems" - https://www.leoniemonigatti.com/papers/memgpt.html - core memory vs. archival memory architecture, per-block token limits.
- LangChain, "Memory overview" - https://docs.langchain.com/oss/python/concepts/memory and Forum thread "Separate Long term memory and Checkpointing" - https://forum.langchain.com/t/separate-long-term-memory-and-checkpointing/1668 - checkpointer (automatic, thread-scoped) vs. Store (explicit, cross-thread) distinction.
- CrewAI, "Memory" - https://docs.crewai.com/en/concepts/memory and DeepWiki, "Memory Configuration and Storage" - https://deepwiki.com/crewAIInc/crewAI/7.2-memory-configuration-and-storage - short-term (ChromaDB), long-term (SQLite), entity memory tiers.
- OpenHands, "OpenHands Context Condensation for More Efficient AI Agents" - https://www.openhands.dev/blog/openhands-context-condensensation-for-more-efficient-ai-agents and "Context Condenser" docs - https://docs.openhands.dev/sdk/guides/context-condenser - keep_first, max_size, attention_window, linear vs. quadratic cost scaling, ~50% cost reduction.
- arXiv, "STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?" - https://arxiv.org/abs/2605.06527v1 - implicit-conflict staleness failure mode, 55.2% best-model accuracy, staleness-penalty/confidence-gating recommendation.
- claudelint, "claude-md-size" - https://claudelint.com/rules/claude-md/claude-md-size - 40KB default threshold, 30KB early warning, modular .claude/rules/ + @import recommendation.
- Hacker News discussion, "What's the right size claude.md file in your experience?" - https://news.ycombinator.com/item?id=45689618 - practitioner consensus on <200 lines, ~150-200 effective instruction budget, 40,000-line CLAUDE.md anecdote.
- MindStudio, "How to Use Session Start Hooks to Force Context Into Every Claude Code Session" - https://www.mindstudio.ai/blog/session-start-hooks-claude-code-force-context and "Progressive Disclosure in AI Agents" - https://www.mindstudio.ai/blog/progressive-disclosure-ai-agents-context-management - SessionStart / additionalContext mechanics, progressive-disclosure framing.
- GitHub, anthropics/claude-code issues #15719 (https://github.com/anthropics/claude-code/issues/15719), #41818 (https://github.com/anthropics/claude-code/issues/41818), #28728 (https://github.com/anthropics/claude-code/issues/28728) - auto-compact threshold configurability debates and default values (~95% default, practitioner-preferred 40-70%, ~33-45K token reserved buffer).
Note: some cited blog/aggregator sources (e.g., MindStudio, IntuitionLabs, particula.tech) synthesize primary research rather than being primary sources themselves; where possible the underlying primary source (Anthropic, Manus, Chroma, arXiv papers) is also listed and should be treated as the higher-confidence reference. Specific numeric claims attributed to informal blog posts (e.g., exact latency-per-token-count figures) should be treated as illustrative rather than rigorously benchmarked.

