The Context Rot Problem: Why AI Coding Agents Get Worse As They Work — and What to Do About It
A research paper on why long-running AI coding sessions degrade in quality — and a concrete architecture (Progressive Prompt Ephemerality) for keeping context windows clean as agents accumulate work.
Ph.D., Co-Founder & CTO, Empromptu

AI coding agents degrade over long sessions because cache economics push context to grow without bound. This paper diagnoses the root tension (token-caching incentives vs cognitive quality), surveys five current mitigations, and introduces Progressive Prompt Ephemerality — a tiered context architecture that resolves both axes.
Context Rot is the gradual degradation of model output quality that occurs as accumulated, irrelevant content crowds the working context window over a long session. It is the structural consequence of two opposing pressures: token-caching economics reward keeping context additive and stable, while attention-based models degrade as context grows with irrelevant material. Progressive Prompt Ephemerality (PPE) is a tiered context architecture that resolves the tension — system instructions are permanently cached, session history is maintained as a slowly-growing incremental summary, and current-turn content is kept ephemeral and discarded after distillation. This paper names the tension, explains its mechanical origins, evaluates compaction-based mitigation, and proposes PPE as the architectural alternative.
1. The Long-Session Problem
Anyone who has used an AI coding agent for an extended session has experienced a familiar pattern: the tool starts sharp and responsive, surfaces the right abstractions, and seems to hold the whole project in mind. Then, gradually, something changes. It begins repeating suggestions. It forgets constraints established early in the session. It starts generating code that conflicts with patterns it helped create an hour ago. It loses the thread.
This degradation is not a bug in the ordinary sense. It is a predictable consequence of how large language models process information, and it interacts badly with the architectural choices most agentic systems make about context management.
Understanding why this happens, and what could be done differently, requires engaging with two bodies of technical knowledge: the performance properties and issues of attention-based models, and the economic mechanics of token caching.
2. How Attention-Based Models Use Context
Large language models based on the transformer architecture do not read context the way a human reads a document. A human can put one book down and pick up another; they can skim, skip, and selectively attend based on relevance to the task at hand. Transformer models cannot. Every token in the context window participates in the attention computation for every generated token, regardless of whether it is relevant to the current task.
This has a direct and measurable consequence: context that is irrelevant to the current task degrades model performance, even when it does not contradict anything in the task description. The model is not ignoring the irrelevant material — it cannot. It is forced to integrate it into every inference step.
Research has documented what practitioners call the "lost in the middle" phenomenon: model performance on retrieval and reasoning tasks drops substantially when the relevant information is surrounded by large amounts of irrelevant context, even when the total context remains within the model's nominal window limit. The degradation is not linear, but it is consistent. Independent practitioner write-ups from Chroma Research, Diffray, and Redis have converged on the same observation under different names — context dilution, context rot, context degradation — all pointing at the same underlying mechanism.
The practical implication is direct: if you can accomplish a task with N tokens of well-chosen context, adding K tokens of material that is not relevant to that specific task — resolved prior turns, tool definitions for tools not in use, file contents not germane to the current edit — will produce a worse result. The performance penalty scales with K. This relationship is what we mean by Context Rot: the gradual degradation of model output quality as accumulated, irrelevant material crowds the working context.
3. The Token Caching Incentive
Set against this cognitive dynamic is an economic one that pushes in exactly the opposite direction.
Modern AI API providers — including Anthropic and OpenAI — offer prompt caching: a mechanism that dramatically reduces the cost of re-processing context that has been seen before. When a request begins with a prefix of tokens that is identical to a prefix seen in a recent prior request, the provider can reuse internal model states computed during that prior request rather than recomputing them from scratch. The cost reduction is substantial — typically around 90% for the cached portion of the input.
The critical constraint is directional: caching only applies to an unbroken prefix. If a prior request contained tokens `[A B C D E F G]`, then a new request beginning with `[A B C D E]` benefits fully from caching. A request beginning with `[A B C D E Z]` still benefits on `[A B C D E]`. A request beginning with `[B C D E F G]` — the same content, but without the original prefix — receives no cache benefit at all and is billed at full input token cost.
This constraint creates a powerful economic incentive to keep early context stable and unchanging. Every time a system modifies or removes content from the beginning or middle of the context window, it breaks the prefix and forfeits cached tokens. Every modification requires those tokens to be re-processed and re-cached at full cost.
Agentic coding systems have rationally responded to this incentive by making their context management additive by default: new turns are appended, tool call results are appended, reasoning traces are appended. The prefix stays intact and keeps accumulating cache hits. The bill for each incremental turn is low, because only the new material at the end needs fresh processing.
The consequence, of course, is that context grows without bound — and Context Rot sets in.
4. The Structural Tension
The two dynamics above are in direct conflict:
- •Cognitive performance demands that context be kept compact, relevant, and free of resolved or irrelevant material.
- •Cache economics reward keeping context additive, stable, and long — because any modification breaks the prefix and triggers re-processing costs.
Mainstream agentic tools have, largely by necessity, optimized for the cache economics. The result is systems that are cheap to run turn-by-turn but accumulate cost and performance degradation over a session. Users experience this as the "long-session problem" described in the introduction. Teams paying close attention to costs notice that their monthly AI bills climb in ways that are difficult to attribute to specific tasks.
Both professional intuitions (i.e. "this is getting expensive" and "this is getting less accurate") are correct, and they share this common root cause.
The five mainstream context-management strategies sit at different points on this trade-off. Reading across the table, no strategy in the leftmost two columns resolves the tension — they all push it onto one axis or the other. PPE, in the rightmost column, is the first pattern to align both:
Comparison: Strategy × 5 dimensions
Pure additive (no compaction)
- •Cache-cost behavior: Optimal turn-by-turn
- •Late-session cognitive quality: Severe Context Rot
- •Information fidelity: Full (but unusable)
- •Implementation complexity: Trivial
- •Resolves the tension?: No — cognitive collapse
Reactive compaction (Claude Code)
- •Cache-cost behavior: Sawtooth (cache resets)
- •Late-session cognitive quality: Degrades pre-compaction
- •Information fidelity: Lossy (20-98% loss)
- •Implementation complexity: Moderate
- •Resolves the tension?: Partial mitigation
Context Editing (Anthropic beta)
- •Cache-cost behavior: Sawtooth, finer-grained
- •Late-session cognitive quality: Better than reactive
- •Information fidelity: Operator-controlled
- •Implementation complexity: High
- •Resolves the tension?: Partial, finer control
Active folding (Context-Folding, AgentFold)
- •Cache-cost behavior: Stable (subtree-ephemeral)
- •Late-session cognitive quality: Strong
- •Information fidelity: Subtask-summarized
- •Implementation complexity: High
- •Resolves the tension?: On the subtask axis
Progressive Prompt Ephemerality
- •Cache-cost behavior: Stable, growing prefix
- •Late-session cognitive quality: Compact, acutely relevant
- •Information fidelity: Lossless via recall
- •Implementation complexity: High
- •Resolves the tension?: Yes
The remainder of this paper explains why PPE is the right point in the design space and why no mainstream tool implements it today.
5. Compaction: The Current Mitigation and Its Limits
The state-of-the-art response to this tension in current tools is compaction: a process by which the accumulated conversation history is summarized into a shorter representation, and the original history is replaced with that summary.
Claude Code, for example, implements a tiered compaction system. At low token pressure, it rearranges content to maximize cache hits at no LLM cost. As pressure increases, it truncates oldest messages (LRU-style), then performs staged summarization, and ultimately — at high pressure — launches a dedicated LLM subagent to produce a narrative summary of the full conversation history. Manual compaction is also available at any time via the `/compact` command. Empirically, Claude Code triggers automatic context compaction at approximately 80-95% of the context window, or roughly 160K tokens in a 200K context.
At the API level, Anthropic has introduced a Context Editing beta that exposes more direct control, including the ability to clear older tool results and trigger server-side compaction at configurable thresholds. The OpenHands software agent documents an analogous approach — its `LLMSummarizingCondenser` has been shown to reduce API costs by up to 2× with no degradation in agent performance on the specific tasks studied.
These are genuine improvements. However, they leave the fundamental tension unresolved for several structural reasons.
5.1 Compaction Is (Mostly) a Reset, Not a Roll
The critical architectural distinction is this: compaction replaces the conversation prefix with a summary. By definition, the summary is a different sequence of tokens from what it replaced. This breaks the prefix cache entirely at the point of compaction (though it may preserve tool definitions and other unchanging inputs). The new summary must be re-processed and re-cached at full cost before subsequent turns can benefit from caching again. Every compaction event therefore incurs a cost spike.
This means the pattern of costs in a compaction-based system is not smooth — it is sawtooth-shaped: gradual accumulation of cache-cheapened turns, followed by a compaction event that resets the cache and causes a one-time re-processing cost, followed by gradual accumulation again.
5.2 Compaction Is Reactive, Not Proactive
Current compaction systems trigger near context limits — which means the model has already been operating with a heavily loaded, degraded context for some time before any mitigation occurs. The performance cost of Context Rot (and the cumulative financial cost of repeating even cached tokens over and over via large maintained context) is paid throughout the late phase of each compaction cycle.
5.3 Observed Compression Ratios Are Modest
In our practical work, compaction events reduce total context size by roughly 20-45% in typical sessions. This is far from the dramatic reduction that would be needed to restore a clean, high-quality working context. A system operating at 95% capacity that compacts to approximately 55% will reach capacity again relatively quickly, and the model will have been operating in a degraded state throughout the high-pressure phase.
5.4 Information Loss Is Aggressive At The Extremes
When context pressure becomes extreme, aggressive compaction strategies — particularly LLM-generated narrative summaries — can reduce hundreds of thousands of tokens to a few thousand. Santoni (2026) documents the observed behavior of Claude Code's native auto-compaction:
In measured sessions, Claude Code's native auto-compaction reduced 132,000 tokens of accumulated message state to approximately 2,300 tokens — a 98% reduction. The resulting summary necessarily loses most of the nuanced reasoning, architectural understanding, and convention knowledge built up over the session. Each new session starts from scratch; the cost of building context is paid repeatedly, and the resulting understanding is never preserved in a reusable form.
This is the cost the bill never itemizes. The token count is small. The cognitive capital destroyed is enormous.
6. The Research Frontier: Active Context Management
The limitations of passive compaction have not gone unnoticed by the research community. A generation of papers published in late 2025 and early 2026 has begun to explore architectural alternatives centered on active rather than reactive context management — and these papers provide direct empirical support for the approach advocated in this white paper.
6.1 Context-Folding
Sun et al. (2025) introduced the Context-Folding framework, in which an agent can procedurally branch into a sub-trajectory to handle a subtask and then "fold" it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. The results are striking: their Folding Agent achieved 62.0% on BrowseComp-Plus and 58.0% on SWE-Bench Verified using only a 32K token active context budget, matching or outperforming baselines that required 327K token contexts and significantly outperforming summarization-based methods. The key insight — that subtask context can be made genuinely ephemeral rather than accumulated into the main thread — is directly analogous to the ephemeral Tier 3 of the Progressive Prompt Ephemerality architecture described below.
A parallel paper, AgentFold, introduced a similar approach independently, framing it as inspired by "the human cognitive process of retrospective consolidation." AgentFold's "folding operation" can perform granular condensations to preserve fine-grained details or deep consolidations to abstract away entire multi-step subtasks.
6.2 Lossless Context Management
Ehrlich and Blackman (2026) directly address the information-loss problem of compaction by introducing a deterministic architecture that maintains an immutable store of all session content verbatim while assembling a compressed active context from summary nodes. The key design: "Summary nodes function as materialized views over the immutable history: they are a derived cache; the immutable history remains the sole source of truth." Any summary can be replaced by the original content via a recall tool, enabling the agent to retrieve granular historical detail without carrying it in the active context at all times. When benchmarked against Claude Code on the OOLONG long-context evaluation, the LCM-augmented agent achieved higher scores at every context length between 32K and 1M tokens.
6.3 Structured Adaptive Context Management
JetBrains Research (2025) studied the two primary approaches to context management in coding agent settings: observation masking and LLM summarization. Their finding is that a hybrid approach — combining selective masking of irrelevant content with targeted summarization — outperforms either approach in isolation, and that the key variable is not how aggressively context is compressed, but how intelligently irrelevant content is identified and excluded before compression.
6.4 Contextual Memory Virtualisation
Santoni (2026), whose documentation of compaction losses is cited above, proposes Contextual Memory Virtualisation (CMV): a system that models session history as a Directed Acyclic Graph (DAG) with snapshot, branch, and trim primitives. CMV's three-pass structurally lossless trimming algorithm preserves every user message and assistant response verbatim while reducing token counts by a mean of 20% and up to 86% for sessions with significant structural overhead, without requiring any LLM calls for the trimming process itself.
7. Progressive Prompt Ephemerality: A Better Architecture
Building on the insights from both the economics of caching and the emerging research on active context management, we propose Progressive Prompt Ephemerality (PPE) — an architecture that directly resolves the structural tension between prompt economics and LLM performance. The core insight is that different parts of the context have different rates of change and different relevance to any given task, and the context management system should be structured to reflect this.
The architecture organizes context into three distinct tiers:
Tier 1: Stable System Instructions (Fully Cached)
The first portion of the context contains material that does not change for the lifetime of the session: system instructions, tool definitions, project-level configuration, and any persistent conventions or constraints the agent must respect. This material is written once, cached once, and never modified. It forms the permanent prefix of every request and benefits maximally from caching — these tokens cost almost nothing to include after the first request.
Because this tier is stable, its cache value compounds over the session. The longer the session runs, the more requests benefit from this cached prefix, and the more cost-efficient it becomes to keep it comprehensive.
Tier 2: Accumulated Summary (Slowly Growing, Consistently Ordered, Mostly Cached)
The second portion contains a structured summary of everything that has happened in the session so far: decisions made, files modified, errors encountered and resolved, patterns established, and work completed. This summary grows slowly but continuously. New content is appended and the order of inputs is maintained, but existing content is not modified, which means the prefix remains intact and cached for all new appends.
The key difference from compaction is timing: the summary is updated incrementally after each turn rather than reactively near context limits. Each turn produces a small, structured summary snippet that is appended to Tier 2. This snippet can itself be cached on the following turn, since it has now become part of the stable prefix.
This tier can also be persisted across sessions, allowing the agent to resume work with full awareness of prior context rather than starting from scratch — a capability directly analogous to the LCM Immutable Store described above.
Tier 3: Ephemeral Recent Interactions (Not Cached, Not Retained)
The third and final portion contains only the blow-by-blow of the current turn: tool call inputs and outputs, retrieved file contents, intermediate reasoning, raw command results. This material is genuinely ephemeral, relevant to the current step and unhelpful thereafter in their raw form. After each turn, the relevant portions are distilled into a Tier 2 summary snippet, and the ephemeral content is discarded, or handed to the recall tool described below.
Because Tier 3 is never cached and is discarded after each turn, it does not accumulate. And because it is processed by the core LLM and agent system as being at the end of the input context, it also maximally preserves the cache. The total context at any given moment is: (stable system instructions) + (growing but mostly-stable summary) + (current turn's ephemeral content). The first two components are almost entirely cached; the third is small and new on each turn.
7.1 Why This Resolves The Tension
Progressive Prompt Ephemerality aligns the cognitive and economic incentives rather than forcing a trade-off between them:
- •Cache economics are served by Tiers 1 and 2, which are stable, additive, and maximally cacheable. Cache hits on these tiers are nearly perfect and improve as the session lengthens.
- •Cognitive performance is served by Tier 3, which is kept compact and acutely relevant to the current task. The model is not forced to attend to resolved prior turns or irrelevant file contents.
- •Information fidelity is preserved by the incremental summarization process, which converts ephemeral content into structured, compact records rather than discarding it outright. As with LCM, the full underlying record can be maintained separately and recalled on demand.
Crucially, this architecture rarely requires a full prefix reset. The cached prefix grows slowly and predictably, and the ephemeral portion is always small. There is no sawtooth cost pattern, no aggressive information loss, and no late-session degradation from context overload. This pattern does still slowly grow context and eventually may need full-context compression, but it allows significantly more turns before these issues arise, and can be handled before Context Rot sets in while preserving cost.
7.2 The Recall Capability
A secondary benefit of this architecture is that the distilled summaries in Tier 2 (created turn-by-turn as a summary of the ephemeral inputs in Tier 3) form a structured, queryable record of the session's history. As an example, information in Tier 3 may include full read file contents, specific changes made, or web research achieved by subagents. A secondary database maintains, and assigns and writes a tag alongside this full information, which is added to the summaries in the relevant Tier 2 information.
If the current task requires information from several turns ago — like a specific error message, the reasoning leading to a design decision, or the contents of a file that was read but only maintained as a summary in Tier 2 — a recall tool mechanism can retrieve the relevant full text references by the Tier 2 snippet and inject it into Tier 3 for the current turn (or a few turns, depending on an additional user or agent choice). However, since the current turn rarely needs any one individual piece of past information, these "full recall" events are limited. Additionally, the information is procedurally recalled and added to tier 3, allowing recall to be on-demand with very little compute cost. This is directly analogous to the `lcm_expand` tool in the LCM architecture and to the AgentFold "deep consolidation" operation that can recover fine-grained details from abstracted subtask history. The principle is consistent across these independently derived systems: maintain a compressed active context while preserving lossless or near-lossless access to the underlying record. However, this approach adds a caching-aware layer to context construction, allowing the benefits of near-lossless context compression with the cost benefits of a consistently maintained cached context.
8. Why Mainstream Tools Don't Do This
The obvious question is: if this architecture is superior, why don't tools like Claude Code implement it?
API constraints. Most agentic tools are built on top of provider APIs that expose an additive message history as their primary interface. Modifying or removing messages from the middle of the history — as PPE requires — has historically not been supported. Context editing APIs are only beginning to emerge, and they remain in beta with limited capabilities. As the PwC caching evaluation noted, "the benefits of prompt caching in real-world agentic workloads remain under-explored in the research literature" even as of early 2026.
Incentive alignment. Token volume is, in most provider pricing models, the primary driver of revenue. Systems that dramatically reduce token usage are not in the immediate commercial interest of the providers whose APIs those systems call. This does not imply bad faith — the economics of building and serving frontier models are genuinely demanding — but it does mean that complex efficiency-maximizing context architectures are unlikely to originate from providers themselves.
Complexity. Implementing incremental summarization correctly is harder than implementing additive accumulation. It requires reliable distillation logic, careful decisions about what to retain versus discard, and a more sophisticated orchestration layer. The JetBrains Research paper notes that the key variable in context management quality is the intelligence of relevance discrimination, not the aggressiveness of compression — and building that discrimination reliably is non-trivial engineering.
Visibility. For most users, the degradation from Context Rot is gradual and ambiguous. It is easy to attribute worse output to the model, to the prompt, or to the task complexity rather than to the mechanics of context management. Without clear visibility into context health over a session, the incentive to invest in better architecture is weak. Ideally these performance issues should be evaluated with the aid of context-aware benchmarks, but in practice this is rarely done.
9. Implications for Teams and Practitioners
For teams making decisions about AI infrastructure today, several practical implications follow from this analysis. The throughline: the orchestration layer is where context architecture lives, and owning that layer is what lets a team resolve the cache-vs-cognition tension on terms its workload actually requires — rather than waiting for the providers whose pricing models depend on token volume to solve it on their behalf.
Monitor context size actively. Most agentic tools offer some form of context inspection (e.g., `/context` in Claude Code). Use it. Treat a rapidly growing context as a signal that output quality is degrading, not just that costs are rising. The pattern is the same one the Empromptu orchestration layer surfaces for production agent workloads — context health is a metric, not a vibe.
Compact proactively, not reactively. If you are using a tool with manual compaction, trigger it at the end of discrete task phases rather than waiting for the automatic trigger near context limits. Smaller, more frequent compactions preserve more information per token and avoid the acute degradation of very late-session contexts. The principle that earlier discipline beats later mitigation is the same one Empromptu's governance layer applies to production agent traffic — compact at decision boundaries, not at capacity ceilings.
Use persistent configuration aggressively. Anything that is true about every session — architectural constraints, conventions, persistent patterns — should live in stable, session-persistent configuration (e.g., `CLAUDE.md`) rather than in conversational context. This approximates Tier 1 of the PPE architecture and benefits from caching across sessions. The same logic applies one level up: project-spanning patterns belong in your platform's orchestration layer, where they are cached once and inherited by every session your team starts.
Structure tasks to minimize context bleed. Tasks that can be decomposed into discrete, bounded sessions with clear handoffs produce better output and lower costs than open-ended sessions that grow without bound. The subagent pattern in tools like Claude Code is underused for this purpose: a subagent that completes a bounded task and returns a structured result is a natural implementation of ephemeral Tier 3 processing, closely analogous to the Context-Folding "branch-and-fold" mechanism. The same decomposition discipline shows up at the production-system level — see how Empromptu's Alchemy platform treats every agent task as a bounded session with structured handoff.
Evaluate infrastructure costs with context efficiency in mind. When comparing agentic tools, total token consumption per unit of useful work is a more meaningful metric than per-token price. A tool that uses 3× the tokens at half the per-token cost is more expensive, not cheaper. This is the kind of total-cost-of-orchestration math the Empromptu approach was designed to make legible to teams deciding whether to rent intelligence or own it.
Consider third-party context management layers. The research community has produced deployable implementations of lossless context management, including open-source plugins such as Lossless Claw for OpenClaw, which implements the LCM hierarchical summary DAG architecture. For teams with sufficient engineering resources, evaluating these alternatives against native compaction may yield meaningful improvements in both cost and output quality. The Empromptu Alchemy platform treats context architecture as a first-class concern in the orchestration layer — the same load-bearing position this paper argues every production AI team should be designing for.
10. Conclusion
The long-session degradation that practitioners observe in AI coding agents is not a mystery, and it is not primarily a model capability problem. It is a structural consequence of an architectural tension between cache economics and cognitive performance. Current tools resolve this tension in favor of the former at the expense of the latter.
Compaction-based approaches represent a meaningful partial mitigation, but they remain reactive, prefix-breaking, and coarse. They reduce token costs without resolving the underlying performance degradation, and they introduce sawtooth cost patterns rather than smooth, efficient operation. The information loss documented in compaction events is a primary cost of the current approach, and happens both with higher frequency and at higher context-rot than would be desired, due to a high degree of inclusion in the main context window prior to that compaction.
Progressive Prompt Ephemerality offers a path to genuinely resolving the tension: a tiered context architecture in which stable material is cached maximally, historical context is preserved compactly in incrementally updated summaries, and the working context for any given task is kept small and acutely relevant. Emerging research from multiple independent groups — Context-Folding, LCM, AgentFold, Contextual Memory Virtualisation — confirms the core intuition from multiple angles and provides empirical evidence that active, tiered context management outperforms passive accumulation and reactive compaction on real agentic benchmarks.
The barriers to adoption are real but not fundamental. As context editing APIs mature and the cost of context rot becomes more legible to practitioners and buyers, the incentive to implement more sophisticated context management will grow.
Frequently Asked Questions
What is Context Rot?
Context Rot is the gradual degradation of model output quality that occurs as accumulated, irrelevant content crowds the working context window over a long session. It is caused by the way transformer attention forces every token in context to participate in every inference step, regardless of relevance.
What is Progressive Prompt Ephemerality?
Progressive Prompt Ephemerality (PPE) is a tiered context architecture in which system instructions are permanently cached (Tier 1), session history is maintained as a slowly-growing incrementally updated summary (Tier 2), and current-turn content is kept ephemeral and discarded after distillation (Tier 3). It resolves the structural tension between prompt cache economics and LLM cognitive performance.
Why don't mainstream coding agents implement PPE today?
Four structural barriers: API constraints (most provider APIs expose additive message history and don't support mid-context editing yet), incentive misalignment (token volume drives provider revenue), implementation complexity (incremental summarization with reliable relevance discrimination is harder than additive accumulation), and weak visibility into context health (degradation is easy to misattribute to the model).
Is compaction enough to solve the long-session problem?
No. Compaction is a reactive, prefix-breaking, coarse mitigation. It reduces token cost spikes but does not resolve the underlying performance degradation, and observed compression ratios (20-45%) are insufficient to restore a clean working context. Information loss at the extremes can exceed 98% of accumulated state.
What can a team using Claude Code do today to mitigate Context Rot?
Five immediate actions: monitor context size actively via `/context`, compact proactively at discrete task boundaries rather than waiting for the auto-trigger, push every session-stable convention into `CLAUDE.md` (Tier 1 equivalent), structure tasks into bounded subagent sessions to keep ephemeral context truly ephemeral, and evaluate AI infrastructure costs in terms of tokens-per-unit-of-useful-work rather than per-token price.
References
1. Diffray AI — Context Dilution: https://diffray.ai/blog/context-dilution/
2. Eval.16x.Engineer — LLM Context Management Guide: https://eval.16x.engineer/blog/llm-context-management-guide
3. Redis — Context Rot: https://redis.io/blog/context-rot/
4. Chroma Research — Context Rot: https://www.trychroma.com/research/context-rot
5. arXiv:2307.03172 — Lost in the Middle: https://arxiv.org/pdf/2307.03172
6. arXiv:2601.06007v1: https://arxiv.org/html/2601.06007v1
7. AgentMarketCap — Prompt Cache Hit Rate Engineering 2026: https://agentmarketcap.ai/blog/2026/04/11/prompt-cache-hit-rate-engineering-2026
8. arXiv:2602.22402 (Santoni): https://arxiv.org/pdf/2602.22402
9. Sun, W., Lu, M., Ling, Z., Liu, K., Yao, X., Yang, Y., Chen, J. (2025). "Scaling Long-Horizon LLM Agent via Context-Folding." arXiv:2510.11967. https://arxiv.org/abs/2510.11967
10. Ye, A. et al. (2025). "AgentFold: Long-Horizon Web Agents with Proactive Context Management." arXiv:2510.24699. https://arxiv.org/abs/2510.24699
11. Ehrlich, Blackman. (2026). "LCM: Lossless Context Management." arXiv:2605.04050. https://arxiv.org/abs/2605.04050
12. Santoni, C. (2026). "Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents." arXiv:2602.22402.
13. Lumer, E. et al. (2026). "Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks." arXiv:2601.06007. PricewaterhouseCoopers. https://arxiv.org/pdf/2601.06007
14. JetBrains Research. (2025). "Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents." Deep Learning for Code Workshop, NeurIPS 2025. https://blog.jetbrains.com/research/2025/12/efficient-context-management/
15. Wang, X. et al. (2024). "OpenHands: An Open Platform for AI Software Developers as Generalist Agents." arXiv:2407.16741. https://arxiv.org/abs/2407.16741. See also https://github.com/OpenHands/software-agent-sdk
16. Emergent Mind. (2025). "Context Degradation in LLMs." Survey of the literature on positional bias and attention mechanisms. https://www.emergentmind.com/topics/context-degradation-in-llms
17. Chhikara, P. et al. (2026). "SWE Context Bench: A Benchmark for Context Learning in Coding." arXiv:2602.08316.
Appendix: Glossary
Attention mechanism — The core computational operation in transformer-based language models, in which every token in the context attends to every other token to compute its representation. The source of both transformers' power and their sensitivity to irrelevant context.
Compaction — A context management strategy in which accumulated conversation history is replaced with a shorter summary. Reduces token count but generally breaks the prefix cache and discards information.
Context Rot — The gradual degradation of model output quality that occurs as accumulated, irrelevant content crowds the working context window over a long session.
Prefix caching — A provider-side optimization that reuses internal model states for token sequences identical to those processed in prior recent requests. Applies only to unbroken prefixes; any modification to cached tokens breaks the cache benefit.
Progressive Prompt Ephemerality (PPE) — A tiered context architecture in which system instructions are permanently cached, session history is maintained as a slowly-growing incremental summary, and current-turn content is kept ephemeral and discarded after distillation.
Token — The basic unit of text processed by a language model. Roughly 0.75 words on average in English. Model context windows, costs, and performance limits are all measured in tokens.
---
*Sean Robinson, Ph.D. is an engineer at Empromptu, where he works on production AI infrastructure. He can be reached at sean@empromptu.ai. For more on the orchestration layer Sean and the Empromptu team are building, see Alchemy by Empromptu and the platform overview.*
*Context architecture is the orchestration layer's load-bearing concern. If your team is hitting the long-session problem, the path forward is to own the orchestration layer — not to wait for the providers whose token volume drives their pricing model to solve it on your behalf.*
See this in production — in your stack.
Empromptu turns the architecture in this whitepaper into a deployed, governed, production-grade AI layer — in weeks, not quarters.
Prefer to read offline? Download the PDF.