Memory & reasoning

08 Memory & reasoning · what agents remember and how they think

Bigger context windows didn't fix memory.

The newest models can read up to a million tokens at once, which sounds like more than enough to "remember" anything. In practice, agents still forget, get confused, and contradict themselves across long sessions. Three reasons:

Cost. The math behind transformers means doubling the input roughly quadruples the cost. Reading a million tokens every turn is expensive and slow.
Context rotContext rot 2025. Even when the right answer is sitting in the input, models do worse at finding it as the input gets longer. We'll cover the research on this below.
Re-reading from scratch. Most setups feed the model raw chat history every turn. Nothing is summarized, distilled, or organized. The model does the same work over and over.

"Agent memory" in 2025 is really about solving these three problems with engineering, not by hoping the next model has a longer context window.

Four ways to give an agent memory

These are the patterns you'll see in production. Most real systems mix two or three.

1 · Just the context window

Stuff everything (system prompt, chat history, retrieved docs) into the model's input each turn. Easy to set up. Works for short conversations. Falls apart past around 100k tokens because of context rot.

Good for: chatbots, single-session tasks

2 · RAG (retrieval)

Keep your documents in a vector database. At query time, pull only the most relevant chunks and put those in the context. Scales to huge knowledge bases. Downside: every query starts from raw text, so the model can't build up understanding over time.

Good for: searching large, stable document collections

3 · Memory agents

A separate agent reads incoming conversations and writes short, structured notes ("the user prefers Python", "they're working on a fintech app"). Future queries read those notes instead of raw history.

Good for: long-running assistants, ongoing projects

4 · Graph memory

Store memory as a graph of people, things, and relationships ("Alice → works at → Acme"). Lets the agent answer questions that require connecting multiple facts. More work to build and slower to query.

Good for: legal, scientific, or relationship-heavy domains

"Context rot": why bigger windows aren't a free win

A 2025 study Hong et al., 2025 measured something practitioners had been complaining about: as the input gets longer, models get worse at finding things in it, even when the answer is right there in the text. Early benchmarks (called "needle in a haystack" tests) were too easy because the answer was a single isolated sentence. Real tasks need the model to connect several pieces of information scattered across a long document, and models start to struggle past roughly 50k tokens.

The takeaway for design: don't dump everything into the context and hope the model finds the right bits. Even with a million-token window, your retrieval and summarization should narrow things down so the model only sees what it actually needs.

Open-source memory libraries you can actually use

You don't have to build memory from scratch. Four libraries have emerged as the popular options. They make different tradeoffs:

Library	How it stores memory	What it's good at	Things to know
Mem0Mem0 2025 Chhikara et al. 2025	Vector database with simple add/update/delete operations	Easy to start with, sensible defaults, lots of users	Mostly hand-tuned rules under the hood
ZepZep 2025 Rasmussen et al. 2025	Knowledge graph that tracks how facts change over time	Good for questions like "what did I tell you about X last week?"	Building the graph takes time; queries can be slower
MIRIXMIRIX 2025 Wang & Chen 2025	Multiple indexes at different levels of detail, then merged	Strong at finding the right memory across many past sessions	More moving parts; harder to debug
LiCoMemory arXiv 2025	Lightweight graph that updates as conversation evolves	Reports up to 23% better accuracy than alternatives on the LongMemEvalLongMemEval 2025 test	Newer, fewer companies running it in production

A useful finding from the LongMemEval benchmark Wu et al., ICLR 2025: when teams tried to improve their memory systems, changes to how they retrieved memories helped more than changes to how they stored them. The MemMachine, arXiv 2026 paper measured which tweaks moved the needle most: tuning how many memories to retrieve (+4.2%), how to format the context (+2.0%), the wording of search prompts (+1.8%), and removing query bias (+1.4%). Smarter sentence-splitting only added +0.8%. Lesson: when memory feels broken, look at the retrieval side first.

A simple memory agent, in code

from dataclasses import dataclass, field
from typing import Optional
import time

@dataclass
class MemoryEntry:
    id: str
    text: str           # consolidated, not raw
    embedding: list     # for semantic retrieval
    source_turn_ids: list  # provenance: which raw turns produced this
    created_at: float = field(default_factory=time.time)
    accessed_count: int = 0
    last_accessed: Optional[float] = None

class MemoryAgent:
    """Active consolidation: digest raw turns into structured memory."""
    def __init__(self, llm, embedder, store):
        self.llm = llm
        self.embedder = embedder
        self.store = store

    def consolidate(self, raw_turns: list) -> list[MemoryEntry]:
        # Ask the LLM to extract durable facts, decisions, and preferences
        prompt = self._consolidation_prompt(raw_turns)
        extractions = self.llm(prompt, json_mode=True)
        entries = []
        for e in extractions:
            entry = MemoryEntry(
                id=self._mint_id(),
                text=e["fact"],
                embedding=self.embedder(e["fact"]),
                source_turn_ids=[t["id"] for t in raw_turns],
            )
            self.store.add(entry)
            entries.append(entry)
        return entries

    def recall(self, query: str, k: int = 5) -> list[MemoryEntry]:
        q_emb = self.embedder(query)
        hits = self.store.search(q_emb, k=k)
        for h in hits:
            h.accessed_count += 1
            h.last_accessed = time.time()
        return hits

    def forget(self, max_age_days: int = 90, min_access: int = 1):
        # Drop stale, never-accessed entries; preserve provenance to logs
        cutoff = time.time() - max_age_days * 86400
        self.store.delete_where(
            lambda e: e.created_at < cutoff and e.accessed_count < min_access
        )

Three details from this code that matter in practice:

Always track where memories came from (the source_turn_ids field). Six months from now, when your agent confidently says something wrong, you'll want to trace it back to the conversation that produced the bad memory.
Forgetting is a feature, not a bug. Without a way to drop old, never-touched memories, your store grows forever and gets noisier. The forget() method here is a simple cleanup, but you need something.
Consolidate on a schedule, not every turn. Pulling out long-term facts is expensive. Most teams run consolidation at the end of a session or as a background job, not after every message.

The other big shift: reasoning models

Memory was one big change in 2025. The other was reasoning models: LLMs that "think out loud" before answering. They're trained to produce long internal chains of thought, weighing options and double-checking themselves, before giving a final response. The most popular ones are DeepSeek-R1, OpenAI's o3, Claude with "extended thinking" mode, and Qwen's QwQ.

For agent builders, two things are different now:

Less need for hand-built planning loops. A reasoning model can often solve a multi-step problem in one call where you previously had to write a "plan, then act, then check" loop. Your agent code gets simpler.
The "overthinking" problem. Reasoning models sometimes spiral, writing essay-length deliberations to decide trivial things like which file to open. This shows up in real systems and costs real tokens. WebCoT, arXiv 2025 and Agentic Critical Training, arXiv 2026 are research efforts to fix this by training reasoning models specifically for agent tasks (knowing when to stop thinking, when to back up and try a different approach, etc).

"Reflection": when does an agent fix its own mistakes?

Reflection means the agent looks at its own output and tries to improve it. The original work on this came from two well-known papers: ReflexionShinn 2023 and Self-RefineMadaan 2023, both presented at NeurIPS 2023. The early excitement was that agents could just keep critiquing themselves and get better. The reality, after two more years of research, is more nuanced:

Pure self-criticism often makes things worse. If you ask a model "is your answer correct?" with no other information, it tends to second-guess itself into bad answers. MAR, arXiv 2025 documented this carefully. The earlier "Large Language Models Cannot Self-Correct Reasoning Yet" line of work (ICLR 2024) made the same point.
Reflection works when there's something real to check against. Code that either runs or throws an error. Tests that pass or fail. A search that returns results or doesn't. These give the agent something concrete to reactYao 2023 to. Without that grounding, reflection is just guesswork.
Multiple different agents arguing beats one agent reflecting alone Liu et al., ICLR 2025. If the agents come from genuinely different model families (or have very different prompts), their disagreements surface mistakes a single agent misses. If they're all the same model, they tend to agree even when wrong.
Learn from failures, not successes. SAMULE, arXiv 2025 found that systematically studying what went wrong, and updating the agent's instructions accordingly, helps more than reinforcing what went right. Failures contain more signal.

Memory and reasoning aren't really separate problems. In the best 2025 systems, structured memory feeds focused reasoning, the agent's reflections get written back to memory, and external checks (tests, types, tool errors) gate both.

Practical advice

"Big context window" isn't the same as "memory". Even a million-token context is a scratchpad for one session. Real memory survives across sessions and you have to build it as a separate system.
Start with an existing library. Mem0 or MIRIX will get you 80% of what you need. Don't write your own memory layer until you've hit a real wall with one of these.
Always pair reasoning with something it can check against. Tests, types, schema validation, a tool that returns success or failure. Without these, longer reasoning just produces longer wrong answers.
Watch how much of the context window is actually used. If 80% of your input is irrelevant filler, accuracy drops. If 10% is filler, you're paying for tokens that do nothing. Aim for the middle.
Track where each memory came from. When the agent gets something wrong six months from now, you need to know which conversation taught it that "fact".

This chapter covered what the agent stores about each conversation. The next chapter (Generalists & specialists) covers what the agent intrinsically knows about its domain: how generalist or specialist its scope is, where each piece of its knowledge lives (model weights, fine-tune, system prompt, tool catalog, retrieved chunks), and how the guardrails change as a result.