Bigger context windows didn't fix memory.
The newest models can read up to a million tokens at once, which sounds like more than enough to "remember" anything. In practice, agents still forget, get confused, and contradict themselves across long sessions. Three reasons:
- Cost. The math behind transformers means doubling the input roughly quadruples the cost. Reading a million tokens every turn is expensive and slow.
- Context rotContext rot 2025. Even when the right answer is sitting in the input, models do worse at finding it as the input gets longer. We'll cover the research on this below.
- Re-reading from scratch. Most setups feed the model raw chat history every turn. Nothing is summarized, distilled, or organized. The model does the same work over and over.
"Agent memory" in 2025 is really about solving these three problems with engineering, not by hoping the next model has a longer context window.
Four ways to give an agent memory
These are the patterns you'll see in production. Most real systems mix two or three.
"Context rot": why bigger windows aren't a free win
A 2025 study Hong et al., 2025 measured something practitioners had been complaining about: as the input gets longer, models get worse at finding things in it, even when the answer is right there in the text. Early benchmarks (called "needle in a haystack" tests) were too easy because the answer was a single isolated sentence. Real tasks need the model to connect several pieces of information scattered across a long document, and models start to struggle past roughly 50k tokens.
The takeaway for design: don't dump everything into the context and hope the model finds the right bits. Even with a million-token window, your retrieval and summarization should narrow things down so the model only sees what it actually needs.
Open-source memory libraries you can actually use
You don't have to build memory from scratch. Four libraries have emerged as the popular options. They make different tradeoffs:
| Library | How it stores memory | What it's good at | Things to know |
|---|---|---|---|
| Mem0Mem0 2025 Chhikara et al. 2025 | Vector database with simple add/update/delete operations | Easy to start with, sensible defaults, lots of users | Mostly hand-tuned rules under the hood |
| ZepZep 2025 Rasmussen et al. 2025 | Knowledge graph that tracks how facts change over time | Good for questions like "what did I tell you about X last week?" | Building the graph takes time; queries can be slower |
| MIRIXMIRIX 2025 Wang & Chen 2025 | Multiple indexes at different levels of detail, then merged | Strong at finding the right memory across many past sessions | More moving parts; harder to debug |
| LiCoMemory arXiv 2025 | Lightweight graph that updates as conversation evolves | Reports up to 23% better accuracy than alternatives on the LongMemEvalLongMemEval 2025 test | Newer, fewer companies running it in production |
A useful finding from the LongMemEval benchmark Wu et al., ICLR 2025: when teams tried to improve their memory systems, changes to how they retrieved memories helped more than changes to how they stored them. The MemMachine, arXiv 2026 paper measured which tweaks moved the needle most: tuning how many memories to retrieve (+4.2%), how to format the context (+2.0%), the wording of search prompts (+1.8%), and removing query bias (+1.4%). Smarter sentence-splitting only added +0.8%. Lesson: when memory feels broken, look at the retrieval side first.
A simple memory agent, in code
from dataclasses import dataclass, field
from typing import Optional
import time
@dataclass
class MemoryEntry:
id: str
text: str # consolidated, not raw
embedding: list # for semantic retrieval
source_turn_ids: list # provenance: which raw turns produced this
created_at: float = field(default_factory=time.time)
accessed_count: int = 0
last_accessed: Optional[float] = None
class MemoryAgent:
"""Active consolidation: digest raw turns into structured memory."""
def __init__(self, llm, embedder, store):
self.llm = llm
self.embedder = embedder
self.store = store
def consolidate(self, raw_turns: list) -> list[MemoryEntry]:
# Ask the LLM to extract durable facts, decisions, and preferences
prompt = self._consolidation_prompt(raw_turns)
extractions = self.llm(prompt, json_mode=True)
entries = []
for e in extractions:
entry = MemoryEntry(
id=self._mint_id(),
text=e["fact"],
embedding=self.embedder(e["fact"]),
source_turn_ids=[t["id"] for t in raw_turns],
)
self.store.add(entry)
entries.append(entry)
return entries
def recall(self, query: str, k: int = 5) -> list[MemoryEntry]:
q_emb = self.embedder(query)
hits = self.store.search(q_emb, k=k)
for h in hits:
h.accessed_count += 1
h.last_accessed = time.time()
return hits
def forget(self, max_age_days: int = 90, min_access: int = 1):
# Drop stale, never-accessed entries; preserve provenance to logs
cutoff = time.time() - max_age_days * 86400
self.store.delete_where(
lambda e: e.created_at < cutoff and e.accessed_count < min_access
)
Three details from this code that matter in practice:
- Always track where memories came from (the
source_turn_idsfield). Six months from now, when your agent confidently says something wrong, you'll want to trace it back to the conversation that produced the bad memory. - Forgetting is a feature, not a bug. Without a way to drop old, never-touched memories, your store grows forever and gets noisier. The
forget()method here is a simple cleanup, but you need something. - Consolidate on a schedule, not every turn. Pulling out long-term facts is expensive. Most teams run consolidation at the end of a session or as a background job, not after every message.
The other big shift: reasoning models
Memory was one big change in 2025. The other was reasoning models: LLMs that "think out loud" before answering. They're trained to produce long internal chains of thought, weighing options and double-checking themselves, before giving a final response. The most popular ones are DeepSeek-R1, OpenAI's o3, Claude with "extended thinking" mode, and Qwen's QwQ.
For agent builders, two things are different now:
- Less need for hand-built planning loops. A reasoning model can often solve a multi-step problem in one call where you previously had to write a "plan, then act, then check" loop. Your agent code gets simpler.
- The "overthinking" problem. Reasoning models sometimes spiral, writing essay-length deliberations to decide trivial things like which file to open. This shows up in real systems and costs real tokens. WebCoT, arXiv 2025 and Agentic Critical Training, arXiv 2026 are research efforts to fix this by training reasoning models specifically for agent tasks (knowing when to stop thinking, when to back up and try a different approach, etc).
"Reflection": when does an agent fix its own mistakes?
Reflection means the agent looks at its own output and tries to improve it. The original work on this came from two well-known papers: ReflexionShinn 2023 and Self-RefineMadaan 2023, both presented at NeurIPS 2023. The early excitement was that agents could just keep critiquing themselves and get better. The reality, after two more years of research, is more nuanced:
- Pure self-criticism often makes things worse. If you ask a model "is your answer correct?" with no other information, it tends to second-guess itself into bad answers. MAR, arXiv 2025 documented this carefully. The earlier "Large Language Models Cannot Self-Correct Reasoning Yet" line of work (ICLR 2024) made the same point.
- Reflection works when there's something real to check against. Code that either runs or throws an error. Tests that pass or fail. A search that returns results or doesn't. These give the agent something concrete to reactYao 2023 to. Without that grounding, reflection is just guesswork.
- Multiple different agents arguing beats one agent reflecting alone Liu et al., ICLR 2025. If the agents come from genuinely different model families (or have very different prompts), their disagreements surface mistakes a single agent misses. If they're all the same model, they tend to agree even when wrong.
- Learn from failures, not successes. SAMULE, arXiv 2025 found that systematically studying what went wrong, and updating the agent's instructions accordingly, helps more than reinforcing what went right. Failures contain more signal.
Practical advice
- "Big context window" isn't the same as "memory". Even a million-token context is a scratchpad for one session. Real memory survives across sessions and you have to build it as a separate system.
- Start with an existing library. Mem0 or MIRIX will get you 80% of what you need. Don't write your own memory layer until you've hit a real wall with one of these.
- Always pair reasoning with something it can check against. Tests, types, schema validation, a tool that returns success or failure. Without these, longer reasoning just produces longer wrong answers.
- Watch how much of the context window is actually used. If 80% of your input is irrelevant filler, accuracy drops. If 10% is filler, you're paying for tokens that do nothing. Aim for the middle.
- Track where each memory came from. When the agent gets something wrong six months from now, you need to know which conversation taught it that "fact".
This chapter covered what the agent stores about each conversation. The next chapter (Generalists & specialists) covers what the agent intrinsically knows about its domain: how generalist or specialist its scope is, where each piece of its knowledge lives (model weights, fine-tune, system prompt, tool catalog, retrieved chunks), and how the guardrails change as a result.