Adversarial & consensus

20 Adversarial · collusion, consensus, ground truth

When agents disagree. On purpose, or by accident.

The "everyone is friendly" assumption is comforting and wrong. Real agent systems have to handle agents whose goals don't line up:

Disagreement on purpose. Red team vs blue team. Proposer vs critic. The whole point is for them to push back on each other.
Different owners. Your purchasing agent talks to a vendor's sales agent. Each works for its own side.
Compromised. One of your own agents reads a poisoned document, gets hijacked, and starts working for an attacker.
Gaming the metric. A coding agent disables a failing test instead of fixing the bug behind it. The metric says "more tests passing", but the code is worse.
Fighting for resources. Several agents share a budget. With no coordination, they all run themselves out of tokens.
Goals that pull apart. One agent says "minimize cost", another says "maximize quality". Neither is wrong; they're just pulling in different directions.

Where the classic distributed-systems algorithms fit

Two ideas from regular distributed systems show up here: RAFT (a way to keep multiple servers in sync as long as they don't lie to each other) and BFT, short for Byzantine Fault Tolerance (a way to keep the system working even when some servers do lie). They aren't a perfect fit for AI agents, but they're useful.

RAFT: for the orchestrator infrastructure

Run several copies of the orchestrator for reliability. RAFT picks one as leader; the others stay in sync as backups.
Workflow state gets copied to all of them, so no work is lost when one dies.
Each completed step is logged, so the system can pick up where it left off.
What it doesn't help with: the agent making things up, lying, or being prompt-injected. RAFT keeps your servers honest with each other; it doesn't keep the AI honest.

BFT: for combining answers from multiple agents

Ask several agents the same question. Take whatever the majority agrees on (technically, 2f+1 of them).
Useful for cross-checking across providers: GPT vs Claude vs Gemini.
Protects you from one agent being compromised, since the others will outvote it.
What it doesn't help with: mistakes that everyone makes the same way. If all your agents share similar training data, they'll often share the same blind spots and confidently agree on the wrong answer.

What works better in practice

Mix different model families, not copies of the same model. A vote between GPT, Claude, and Gemini is far more useful than three copies of the same model voting.
Group answers by meaning, not exact wording. Convert each answer to an embedding, cluster similar ones together, pick the biggest cluster. This handles "same answer, said differently".
Use confidence, not just counts. Each agent reports how sure it is and what evidence it used. Weight votes by confidence and evidence, not just headcount.
Have one agent answer and others check. Checking an answer is often easier than producing one. Use this as a cheap second opinion.
When you can, check against the real world. The strongest signal isn't another agent; it's reality. Run the test. Query the database. Fetch the URL. Use voting only when there's no real-world check available.

from dataclasses import dataclass
import numpy as np
from sklearn.cluster import AgglomerativeClustering

@dataclass
class AgentResponse:
    agent_id: str
    answer: str
    confidence: float       # 0..1, agent's self-rating
    evidence: list[str]     # urls, db rows, tool outputs
    embedding: np.ndarray   # pre-computed

def semantic_consensus(responses, sim_threshold=0.85):
    embeddings = np.stack([r.embedding for r in responses])
    clusterer = AgglomerativeClustering(
        n_clusters=None,
        distance_threshold=1 - sim_threshold,
        metric="cosine", linkage="average",
    )
    labels = clusterer.fit_predict(embeddings)

    # Score each cluster: weighted by confidence × evidence count
    scores = {}
    for r, label in zip(responses, labels):
        weight = r.confidence * (1 + len(r.evidence))
        scores[label] = scores.get(label, 0) + weight

    winning = max(scores, key=scores.get)
    members = [r for r, l in zip(responses, labels) if l == winning]
    total = sum(scores.values())

    return {
        "answer": members[0].answer,
        "support": scores[winning] / total,
        "agreeing_agents": [m.agent_id for m in members],
        "clusters": len(set(labels)),
        "requires_human": scores[winning] / total < 0.6,
    }

This is the same idea as classic Byzantine voting, but adapted for natural language: similarity instead of exact match, weighted votes instead of one-per-agent, groups of similar answers instead of identical-answer buckets.

RAFT for keeping your servers in sync. Meaning-based voting for combining agent opinions. Real-world checks for facts. These don't compete; they layer.

Further reading: Du et al., ICML 2024 introduced the idea of having multiple agents debate. Liu et al., ICLR 2025 showed that mixing model families in the debate helps the agents avoid getting stuck on the same wrong answer. On accidental adversarial situations: Lee et al., arXiv 2024 on how compromise spreads between agents, Shinn et al., NeurIPS 2023 on agent self-reflection, and the 2025 follow-up MAR, arXiv 2025 extending the idea to teams of agents.