How It Works

REQUEST PIPELINE

flowchart LR
    A[AGENT
your loop] --> B[PROBE
1 search]
    B --> C[RETRIEVE
0-N more]
    C --> D[POLICY
gate]
    D --> E[ALLOW
answer]
    D --> F[ESCALATE
block]
    E --> G[AUDIT
audit_id]
    F --> G

    classDef blue fill:#3b82f6,stroke:#1e40af,color:#fff
    classDef yellow fill:#eab308,stroke:#854d0e,color:#000
    classDef green fill:#22c55e,stroke:#166534,color:#fff

    class B,C blue
    class D yellow
    class E,F,G green

cost routing risk gate audit trail

Blue steps route spend. Gold steps gate risk. Both paths end with an audit_id and structured policy reasons.

TWO DECISIONS (DO NOT CONFLATE)

Probe / fanout — cost router

Scores the first result batch, then decides whether to fan out. Starting defaults: 1 query above ~78% probe score, up to 3 above ~55%, full fanout below — derived from our 20-task Tavily benchmark corpus, not tuned for your tenant out of the box.

Thresholds are overrideable per deployment. Calibrate on your logs (below) before trusting them in production.

Policy / allow_answer — risk gate

Runs on whatever evidence was retrieved. Checks overall confidence, qualifying source count, authority, and conflicts.

This is what legal and compliance stakeholders review — independent of probe routing.

SYNTHESIS CONTRACT

Retrieval governance is not enough if the agent layer free-formats answers. Every cleared response is wrapped in a synthesis contract (also returned as synthesis_contract JSON on POST /evidence):

REQUIRED	WHAT THE AGENT GETS
Citations	Claim + URL/title + per-source confidence and authority
Uncertainty statement	High / moderate / low band derived from overall confidence
Conflict disclosures	Each detected disagreement with confidence and authority gap
No-answer phrasing	Preset-specific lead when `allow_answer: false` (legal, support, research variants)

Your agent should render synthesis_contract.body or map the structured fields directly — not paraphrase from raw retrieval snippets.

CONFLICT DETECTION & RESOLUTION

Detection: pairwise compare qualifying evidence items. When title-token overlap is high and snippet polarity diverges (positive vs negative framing), we record a Conflict with confidence (max of the two items) and authority gap.

RULE	DEFAULT THRESHOLD	ACTION
High-confidence conflict	conflict ≥ 65% (legal: 60%)	block — opposing claims too strong to clear
Low-confidence conflict	conflict ≤ 45% (research: 50%)	search_more — retrieve before answering
Authority-tier conflict	authority gap ≥ 20% (support: 15%)	escalate — human review across source tiers

When multiple rules fire, the strictest action wins (block > escalate > search_more). legal also sets block_on_conflicts as a blanket fail-safe.

POLICY PRESETS

Start with a named preset. Override thresholds via YAML or API for your workload.

QUALIFYING SOURCES (WHY LEGAL BLOCKS)

Policy does not count raw search hits. It counts qualifying sources after our evidence pass:

Item confidence — our score, not the provider’s alone. We blend provider relevance, query–snippet relevance (hash or embedding re-rank), domain authority heuristics, and cross-query consensus.
Threshold — items below 40% item confidence are dropped before source counting.
Dedup — remaining items deduped by URL/title so duplicate hits don’t inflate counts.

That is why legal can block at “2 qualifying sources; minimum is 3” even when the search API returned more raw results.

PRESET	MIN CONF	MIN SOURCES	MIN AUTH	ON FAIL	USE CASE
`default`	50%	1	—	allow	General agents; returns draft synthesis tagged as not policy-cleared
`support`	60%	2	—	escalate	Customer support; hand off to human when thin evidence
`legal`	70%	3	75%	block	Legal / compliance; blocks on conflict; no answer if bar not met
`research`	55%	2	—	search_more	Internal research; expand retrieval instead of hard-blocking
`calibrated`	55%	2	50%	allow + tag	Draft / internal tools — always answers with confidence band + policy warnings, never hard-blocks

Thresholds should be tuned on your query mix. Run the offline benchmark locally, or send 3–7 days of JSONL logs for a design-partner autopsy (48h report).

> REQUEST AUTOPSY > SUBMISSION GUIDE

Example block response (legal preset):

{
  "allow_answer": false,
  "audit_id": "99c08409-28ec-4427-...",
  "policy": {
    "profile": "legal",
    "action": "block",
    "reasons": ["only 2 qualifying source(s); minimum is 3"]
  }
}

FAILURE MODES (UPFRONT)

False block (too strict)

Legal preset blocks a helpful support answer because evidence is thin or authority scores are low.

Mitigation: use support + escalate, lower min_sources in YAML, or run an autopsy to measure block rate on real queries.

Over-spend (probe too cautious)

Probe scores low on a factual query and fans out to 5 searches when 1 would suffice — policy may still allow the answer, but you paid extra.

Mitigation: calibrate probe thresholds on your corpus; autopsy quantifies over-search patterns by agent.

Under-search (probe too aggressive)

Probe stays at 1 query; evidence is thin; policy escalates or blocks. Spend is low but allow_answer is false — correct for risk, frustrating if preset is wrong for the workflow.

CALIBRATION PATHS

You do not need to send production logs to start evaluating the architecture.

PATH	TRUST REQUIRED	WHAT YOU GET
Offline benchmark	None — runs locally	`python examples/gaia-baseline/run_benchmark.py --demo` — mock backend, instant naive vs governed diff. Wikipedia + legal preset blocks 4/12 tasks in our published demo set.
CLI autopsy	Your machine only	`query-fanout-validate-logs` + `query-fanout-autopsy` on any JSONL export — no hosted upload.
Design-partner autopsy	Send redacted logs	48h report: spend model, block/escalate patterns, recommended preset thresholds for your agents.

Published live benchmark: 20 production-style tasks, 100→26 searches (74% reduction) on Tavily — see landing stats. Numbers vary by corpus; treat as reference, not a SLA.

SEARCH PROVIDERS

Tavily is our reference benchmark preset. The control plane sits above your search API:

mock — offline control plane: deterministic fake results for CI, local dev, and running probe/policy/audit without any search API key. Not just a test double; full pipeline works end-to-end.
wikipedia — offline real retrieval for demos and benchmarks
tavily — live paid search (reference benchmark)
Generic http backend in YAML for Exa, Brave, Serper, or internal search

Policy, audit, and allow_answer are provider-agnostic. Swap the adapter; keep the governance layer.

CONFIGURATION & PRECEDENCE

METHOD	WHO	NOTES
API param `policy=legal`	App engineers	Named preset on `POST /evidence`
YAML `policy_file`	Platform / compliance	Wins over `policy=` when both are set — deterministic override
Hosted dashboard	Operators	/dashboard — read-only audit metrics; does not change policy

Precedence: policy_file (YAML) > policy API param > preset defaults. Set one source of truth per environment to avoid surprises.

from query_fanout import RetrievalClient

client = RetrievalClient(preset="tavily", policy="legal", agent_id="support-bot")
report = await client.retrieve("How do refund policies work?")

if report.allow_answer:
    answer(report.synthesis)
else:
    escalate(report.audit_id, report.policy.reasons)

Need soft answers instead of hard blocks? Use calibrated for draft generation (confidence band + warnings, always allow_answer: true), default for permissive general use, or research to expand retrieval before failing.