Tone Dark
Tint
08 Memory & reasoning · what agents remember and how they think

Bigger context windows didn't fix memory.

The newest models can read up to a million tokens at once, which sounds like more than enough to "remember" anything. In practice, agents still forget, get confused, and contradict themselves across long sessions. Three reasons:

"Agent memory" in 2025 is really about solving these three problems with engineering, not by hoping the next model has a longer context window.

Four ways to give an agent memory

These are the patterns you'll see in production. Most real systems mix two or three.

1 · Just the context window
Stuff everything (system prompt, chat history, retrieved docs) into the model's input each turn. Easy to set up. Works for short conversations. Falls apart past around 100k tokens because of context rot.
Good for: chatbots, single-session tasks
2 · RAG (retrieval)
Keep your documents in a vector database. At query time, pull only the most relevant chunks and put those in the context. Scales to huge knowledge bases. Downside: every query starts from raw text, so the model can't build up understanding over time.
Good for: searching large, stable document collections
3 · Memory agents
A separate agent reads incoming conversations and writes short, structured notes ("the user prefers Python", "they're working on a fintech app"). Future queries read those notes instead of raw history.
Good for: long-running assistants, ongoing projects
4 · Graph memory
Store memory as a graph of people, things, and relationships ("Alice → works at → Acme"). Lets the agent answer questions that require connecting multiple facts. More work to build and slower to query.
Good for: legal, scientific, or relationship-heavy domains

"Context rot": why bigger windows aren't a free win

A 2025 study Hong et al., 2025 measured something practitioners had been complaining about: as the input gets longer, models get worse at finding things in it, even when the answer is right there in the text. Early benchmarks (called "needle in a haystack" tests) were too easy because the answer was a single isolated sentence. Real tasks need the model to connect several pieces of information scattered across a long document, and models start to struggle past roughly 50k tokens.

The takeaway for design: don't dump everything into the context and hope the model finds the right bits. Even with a million-token window, your retrieval and summarization should narrow things down so the model only sees what it actually needs.

Open-source memory libraries you can actually use

You don't have to build memory from scratch. Four libraries have emerged as the popular options. They make different tradeoffs:

LibraryHow it stores memoryWhat it's good atThings to know
Mem0Mem0 2025 Chhikara et al. 2025 Vector database with simple add/update/delete operations Easy to start with, sensible defaults, lots of users Mostly hand-tuned rules under the hood
ZepZep 2025 Rasmussen et al. 2025 Knowledge graph that tracks how facts change over time Good for questions like "what did I tell you about X last week?" Building the graph takes time; queries can be slower
MIRIXMIRIX 2025 Wang & Chen 2025 Multiple indexes at different levels of detail, then merged Strong at finding the right memory across many past sessions More moving parts; harder to debug
LiCoMemory arXiv 2025 Lightweight graph that updates as conversation evolves Reports up to 23% better accuracy than alternatives on the LongMemEvalLongMemEval 2025 test Newer, fewer companies running it in production

A useful finding from the LongMemEval benchmark Wu et al., ICLR 2025: when teams tried to improve their memory systems, changes to how they retrieved memories helped more than changes to how they stored them. The MemMachine, arXiv 2026 paper measured which tweaks moved the needle most: tuning how many memories to retrieve (+4.2%), how to format the context (+2.0%), the wording of search prompts (+1.8%), and removing query bias (+1.4%). Smarter sentence-splitting only added +0.8%. Lesson: when memory feels broken, look at the retrieval side first.

A simple memory agent, in code

from dataclasses import dataclass, field
from typing import Optional
import time

@dataclass
class MemoryEntry:
    id: str
    text: str           # consolidated, not raw
    embedding: list     # for semantic retrieval
    source_turn_ids: list  # provenance: which raw turns produced this
    created_at: float = field(default_factory=time.time)
    accessed_count: int = 0
    last_accessed: Optional[float] = None

class MemoryAgent:
    """Active consolidation: digest raw turns into structured memory."""
    def __init__(self, llm, embedder, store):
        self.llm = llm
        self.embedder = embedder
        self.store = store

    def consolidate(self, raw_turns: list) -> list[MemoryEntry]:
        # Ask the LLM to extract durable facts, decisions, and preferences
        prompt = self._consolidation_prompt(raw_turns)
        extractions = self.llm(prompt, json_mode=True)
        entries = []
        for e in extractions:
            entry = MemoryEntry(
                id=self._mint_id(),
                text=e["fact"],
                embedding=self.embedder(e["fact"]),
                source_turn_ids=[t["id"] for t in raw_turns],
            )
            self.store.add(entry)
            entries.append(entry)
        return entries

    def recall(self, query: str, k: int = 5) -> list[MemoryEntry]:
        q_emb = self.embedder(query)
        hits = self.store.search(q_emb, k=k)
        for h in hits:
            h.accessed_count += 1
            h.last_accessed = time.time()
        return hits

    def forget(self, max_age_days: int = 90, min_access: int = 1):
        # Drop stale, never-accessed entries; preserve provenance to logs
        cutoff = time.time() - max_age_days * 86400
        self.store.delete_where(
            lambda e: e.created_at < cutoff and e.accessed_count < min_access
        )

Three details from this code that matter in practice:

The other big shift: reasoning models

Memory was one big change in 2025. The other was reasoning models: LLMs that "think out loud" before answering. They're trained to produce long internal chains of thought, weighing options and double-checking themselves, before giving a final response. The most popular ones are DeepSeek-R1, OpenAI's o3, Claude with "extended thinking" mode, and Qwen's QwQ.

For agent builders, two things are different now:

"Reflection": when does an agent fix its own mistakes?

Reflection means the agent looks at its own output and tries to improve it. The original work on this came from two well-known papers: ReflexionShinn 2023 and Self-RefineMadaan 2023, both presented at NeurIPS 2023. The early excitement was that agents could just keep critiquing themselves and get better. The reality, after two more years of research, is more nuanced:

Memory and reasoning aren't really separate problems. In the best 2025 systems, structured memory feeds focused reasoning, the agent's reflections get written back to memory, and external checks (tests, types, tool errors) gate both.

Practical advice

This chapter covered what the agent stores about each conversation. The next chapter (Generalists & specialists) covers what the agent intrinsically knows about its domain: how generalist or specialist its scope is, where each piece of its knowledge lives (model weights, fine-tune, system prompt, tool catalog, retrieved chunks), and how the guardrails change as a result.