Beyond software

25 Beyond software · the architectures we have not yet built

What if agents stopped answering and started watching?

Almost every agent shipped today lives inside a software workflow. It writes code, drafts emails, queries a database, fills a form, escalates a ticket. The whole architecture, from prompt to response, presupposes a conversation: someone asks, the agent answers, the conversation ends, the agent ceases to exist until the next call. This is the only kind of agent that exists in production at scale, and it is the only kind this manual has so far described in detail.

There is another kind, almost entirely unbuilt, that is going to matter more than the first. Think of an agent that does not wait for a question. It sits with a continuous stream of data. It develops a felt sense of what is normal in that stream over hours and days. It notices when something is forming. It interrupts a human, unprompted, to point at something the human would not otherwise have seen. It is not a function. It is a witness.

The reason agents are mostly in software right now is not coincidence; it is that software is the easiest domain for the architecture we have. Software bugs have textual signatures. Software workflows produce text trails. Software problems are mostly recombinations of past software problems. The moment you move into domains where the most important events are happening right now, in continuous signal, with no labeled training data because nobody has named the event yet, the call-and-response architecture stops being adequate. That is most of the actually consequential domains: clinical medicine, neuroscience, biomedical research, ecology, materials science, climate. The frontier is not bigger language models for software. The frontier is a different kind of agent for everywhere else.

What is already real (and what it tells us)

The futuristic claim is not science fiction. It is grounded in research that has been published, and in some cases deployed, in the last twelve to eighteen months. The honest case for the architecture below starts by naming what already works.

In intensive care units, AI early-warning systems for clinical deterioration have moved from retrospective validation to prospective deployment. A 2025 meta-analysis in BMC Medical Informatics and Decision Making reviewed five prospectively-validated studies and concluded that AI-based deterioration models significantly reduced in-hospital and thirty-day mortality, with effect sizes large enough to make the case for broader rollout BMC Med Inform 2025. A separate medRxiv preprint reports that under traditional periodic checks, only thirty to fifty percent of patient deteriorations are identified before they become emergencies; an AI system embedded in nursing workflows across three medical centers and 42,759 hospitalizations was able to surface most of the missed cases in real time medRxiv 2025. These are not capability demos. They are clinical trials with mortality endpoints.

A clinical trial registered as NCT07307521 at Zhongshan Hospital in Shanghai, recruiting three hundred ICU patients, takes the architecture a step further. Instead of monitoring waveform data alone, the system uses ceiling-mounted cameras combined with physiologic data and environmental noise levels to detect early signs of agitation, confusion, or sleep disturbance that precedes delirium AIME-ICU 2025. Faces are blurred or replaced with avatars to preserve privacy. The agent is, in a meaningful sense, watching. Not on a question, not on a query, but continuously, and reporting what it sees worth reporting.

In neuroimaging, foundation models trained on hundreds of thousands of subjects have been published this year that do something subtly different from classification: they reconstruct what a healthy brain ought to look like, then flag deviations. Brain Harmony (Dong et al., September 2025) compresses both structural and functional MRI into unified 1D tokens, integrating modalities that were previously analyzed in isolation Brain Harmony 2025. Prima (Lyu et al., September 2025) processes full real-world MRI studies and radiology reports, achieving a mean diagnostic AUROC of 0.92 across health-system-scale patient populations, with explainable reasoning attached Prima 2025. These models are not waiting for a clinician to point at an image and ask. They are processing the image and telling the clinician what to look at.

In neuroscience, the move toward real-time continuous decoding is even more striking. Meta FAIR's work on MEG-based visual perception decoding (Benchetrit, Banville, King) achieves brain-to-image reconstruction at temporal resolutions approaching 5,000 Hz, three orders of magnitude faster than fMRI Meta FAIR 2024. The reason this matters for our argument is not the decoding accuracy; it is the architectural shift. The system does not produce a single answer to a single question. It produces a continuous stream of inferences about what the brain is doing right now, and it does this fast enough to close a feedback loop with the brain itself.

In biology and chemistry, autonomous self-driving laboratories are now executing end-to-end research cycles without human input. A July 2025 review in Royal Society Open Science catalogs the state of the field: today's most capable systems automate hypothesis generation, experimental design, execution, analysis, and updating-of-hypotheses-for-the-next-cycle SDL Review 2025. Novartis's MicroCycle platform autonomously synthesizes new compounds, characterizes them, and chooses the next round. The SAMPLE platform deploys multiple intelligent agents to navigate the protein fitness landscape in parallel, converging on enzymes that are at least 12 °C more thermostable than starting sequences without any human intervention SAMPLE 2024. Periodic Labs, founded in 2025 by Liam Fedus (ex-OpenAI, co-creator of ChatGPT) and Ekin Dogus Cubuk (ex-Google DeepMind materials and chemistry lead), exists for the explicit purpose of building autonomous materials-discovery laboratories Periodic Labs 2025.

A 2025 Frontiers in Artificial Intelligence paper proposes a name for this trajectory: scAInce. The argument: we are no longer talking about AI as a tool inside science; we are talking about AI as a participant in science, with multimodal agentic systems that listen, see, speak, and act, orchestrating cloud software and physical laboratory hardware at a fluency that would have been speculative two years ago scAInce 2025. The progression from text agent to lab agent is a paradigm shift, not a feature update.

The architectural piece that is still missing

All of the systems above are real. Most of them are remarkable. But each, examined closely, is a sophisticated version of the same backward-looking architecture this manual has been describing throughout. The early-warning system runs a classifier over a sliding window. The autonomous lab follows a Bayesian-optimization-driven campaign whose objective was specified by humans. The neuroimaging foundation model produces a likelihood over known classes. The MEG decoder reconstructs perception against a corpus of seen images. None of these systems, as built today, do what a great clinician or scientist actually does: notice the unprecedented. The thing that has no class label yet. The signal that matters precisely because it does not match anything in the training distribution.

The missing piece has a name in cognitive neuroscience: it is the difference between recognition and attention. Recognition is matching against an existing template. Attention is the prior, deeper act of deciding what is worth examining at all. Current agents have astonishing recognition. They have almost no attention of their own. Every shred of attention they exhibit was placed there by a human prompt or a training-time objective.

What follows is what an agent built around attention rather than recognition would need to be. Three properties, one diagram.

The witness architecture

Below is the contrast in one image. On the left, the architecture this manual has spent twenty chapters describing: a function that takes a prompt and returns a response. On the right, the architecture the next decade will need, and that almost no one is building yet: a witness that exists continuously, holds an internal model of the stream it watches, and surfaces things to a human only when something crosses an internally-computed threshold of significance.

Two architectures. The call-and-response agent on the left exists only during a call; the agent itself is a function. The witness on the right exists between events, holds a model of normal in the stream it watches, and reaches out to the human only when something crosses an internal threshold. The breathing pulse on the witness is meant literally: the architecture is alive between events. The dashed arrow on the right is the rare interrupt; in production it should fire at most a handful of times per shift.

Three properties make the witness different from any agent currently shipped at scale.

Continuity. The agent exists across time. It is not constructed-on-demand. Its state at minute N+1 is shaped by what it observed during minute N. Whatever working memory and attention it has accumulated over the last hour are part of how it processes the next second of input. This is trivially implementable as a long-running process; it is non-trivial as a model architecture, because no major foundation model today has a continuous internal state that updates faithfully across hours of streaming data.

An online model of normal. The agent is constantly building, in the background, a representation of "what this stream usually looks like." Surprise is computed against that model, not against a static training distribution. This is what makes the agent capable of detecting the unprecedented: not by recognizing it (it has no class for it), but by recognizing that whatever is happening right now does not fit the local model of normal. The technical name in the unsupervised-anomaly-detection literature is reconstruction-error-based detection, and recent neuroimaging work (Mahé et al., October 2025) shows it works empirically Mahé 2025. What is missing is the integration of this technique into a continuously-running agentic loop rather than a batch analysis tool.

The will to interrupt. The agent decides, on its own, when to surface something to a human. This is the hardest of the three because it requires the agent to have an internal value function over importance: not all anomalies are worth interrupting a clinician for. A great agent in this role behaves like a great resident: it pages the attending only when the situation actually warrants paging, and it has internalized over months which situations those are. We do not know how to train this. The closest analog in current systems is the constitutional AI literature on harm avoidance, which is shaped externally; what is needed is something more like a learned internal sense of significance, calibrated to the human it works with.

Why this matters in domains the manual has not yet covered

The architecture above changes what is possible in five domains where current call-and-response agents are barely useful.

Clinical medicine. Hospitals are awash in data and starved of pattern-seers. A witness embedded in a hospital floor does not just answer "what's wrong with this patient?" It notices that three patients on different wards have presented with the same unusual lab pattern in the last two weeks; that the medication-error rate at 3 a.m. has been creeping up over six months in a way invisible to any single shift reviewer; that the time-to-treatment for stroke has subtly diverged between two demographic groups. These are exactly the kinds of patterns hospital epidemiologists try to find through chart review six months later. A continuously-watching agent would catch them in real time, if its architecture allowed continuous attention with calibrated interruption. The early-warning systems we have today are the first step; they fire on per-patient deterioration. What comes next is per-floor, per-shift, per-population witnessing.

Neuroscience. A witness embedded in an fMRI session would notice that the activation pattern in this subject during this task looks unlike anything in the lab's prior corpus. Not "diagnose disease X" but "this is a new mode, why is this subject doing the task differently from the eighty before them?" A witness embedded in a neural-recording rig would hold representations at multiple timescales simultaneously, milliseconds for spiking and minutes for state transitions, and notice when patterns at one timescale begin to predict events at another. Brain Harmony's unification of structural and functional MRI into one token stream is the first time this kind of cross-timescale modeling is plausible. The witness layer on top of it is the missing piece.

Biomedical research. A witness watching a culture dish over seventy-two hours notices that one colony is doing something the others are not. It does not match the colony to a known phenotype; it flags the dissimilarity itself. A witness sitting in on lab meetings notices that what one researcher is finding in cancer cells looks structurally identical to what another is finding in stem cells, even though the two fields share no vocabulary. Cross-field synthesis at the timing-shaped pace of how science actually unfolds, not the post-hoc pace of survey papers. SAMPLE and MicroCycle are doing closed-loop optimization within a defined search space; the witness is what would notice when the search space itself was the wrong frame.

Field ecology and earth observation. A witness with continuous read access to satellite imagery, weather data, and ground sensors does not answer "is the temperature anomalous?" It notices that a stream-temperature trend, two-week-early vegetation phase shift, and divergent migration pattern, separately unremarkable, together describe a watershed beginning to fail in a way no historical analog matches. Climate science would be transformed by an agent that just sits with the data, all the time, and raises a hand when something is forming.

Materials science and chemistry. A witness embedded in a self-driving lab notices that the data from today's run looks strange in a way that was not what anyone was optimizing for. It suspends the planned campaign. It pursues the strangeness. Possibly it invents a new measurement protocol because the existing one is not sensitive to the weird thing it noticed. This is exactly what graduate students do when they make discoveries; the serendipity is not random but the directed pursuit of a noticed anomaly by an agent free to redirect itself. SDLs as currently built explicitly avoid this: they follow the campaign, they do not notice when the campaign is the wrong question.

Honest constraints on the architecture

Three things have to change before the witness becomes real, and none of them is a small engineering matter.

Continuous internal state in foundation models. The transformers that power current agents have no native concept of state that persists across calls. Every call starts with a context window assembled from outside. Hidden-state-passing architectures like Mamba and the more recent linear-attention work suggest a path, but no production system has built a continuous-attention agent on top of these. State-space models, by their nature, are well-suited to streaming sensor data; the bridge from streaming sensor data to language-mediated reasoning has not been built.

Online uncertainty-aware models of "normal." The unsupervised-anomaly-detection literature exists. What is missing is its embedding in an agent loop where the model of normal updates continuously and is queried at runtime to ask "how surprising is this current observation against what I have seen in the last six hours?" Conformal prediction, covered earlier in this manual, gives the statistical scaffolding. It has not been combined with the continuous-state machinery.

A learned sense of when to interrupt. This is the part that resists pure engineering. A witness that interrupts too often becomes noise; a witness that interrupts too rarely is useless. The right rate is calibrated to the human, the situation, and the cost of missed signal versus false alarm. There is no clean training objective for this; the only path forward seems to be long-running deployment with feedback from the humans the witness interrupts, which means we cannot ship the first version and expect it to work. We have to ship a version that learns, and we have to be willing to wait for it to.

What this manual does not tell you (and why)

Every chapter before this one has been about how to build a production agent system using techniques that work today. This chapter is different. The techniques described here do not yet work as a single composed system. The component pieces exist: foundation models for neuroimaging, autonomous laboratory loops, continuous patient monitoring, unsupervised anomaly detection, conformal prediction, hidden Markov modeling. The integration into a witness-shaped architecture is the unbuilt thing. If you are looking for a turnkey blueprint, it is not in this chapter, because it does not yet exist outside of research labs.

What this chapter offers instead is a reference architecture and a working hypothesis: the next decade of consequential agent work will be done by people who stop trying to make ChatGPT-shaped agents do science and start building entirely different systems for entirely different domains. The fact that almost no one is doing this, despite all the component pieces being publicly available, is the strangest fact about the current state of the field. The incentives all point toward better backward-looking agents for software workflows where there is revenue. The forward-looking work, the work that watches instead of answers, is almost untouched.

If you are an early-career researcher or engineer reading this looking for an open frontier with low competition and high stakes, that frontier is not in software agents. It is in clinics, neuroimaging suites, autonomous labs, and field stations. The first team to build a real witness will not look anything like an AI lab. It will look like a partnership between a foundation-model team, a domain group with continuous-data access, and a small group of people who have spent enough time with the domain experts to know what would actually be worth interrupting them for. That is the team that is going to matter most in the next ten years. This manual is not the manual for that team. The manual for that team has not been written.

The bottom line. The agents this manual has spent twenty chapters describing are powerful but architecturally backwards-looking; they answer questions, they do not watch the world. The unbuilt architecture is a witness: continuous, online-modeling, calibrated-interruption, embedded in domains where the most consequential events are unfolding right now and have no name yet. The component pieces exist, individually published in the last twelve to eighteen months. The integration is the open problem. The frontier is not in better software agents; it is in moving the agentic stance out of software entirely. Whoever does that first will not just have a better product. They will have built a different kind of mind.