HOW IT WORKS

Governance is tuned on your query mix — not shipped as one-size-fits-all thresholds. Two separate decisions per request: how much to search (cost) and whether to answer (risk).

REQUEST PIPELINE

cost routing risk gate audit trail

Blue steps route spend. Gold steps gate risk. Both paths end with an audit_id and structured policy reasons.

TWO DECISIONS (DO NOT CONFLATE)

Probe / fanout — cost router

Scores the first result batch, then decides whether to fan out. Starting defaults: 1 query above ~78% probe score, up to 3 above ~55%, full fanout below — derived from our 20-task Tavily benchmark corpus, not tuned for your tenant out of the box.

Thresholds are overrideable per deployment. Calibrate on your logs (below) before trusting them in production.

Policy / allow_answer — risk gate

Runs on whatever evidence was retrieved. Checks overall confidence, qualifying source count, authority, and conflicts.

This is what legal and compliance stakeholders review — independent of probe routing.

SYNTHESIS CONTRACT

Retrieval governance is not enough if the agent layer free-formats answers. Every cleared response is wrapped in a synthesis contract (also returned as synthesis_contract JSON on POST /evidence):

REQUIREDWHAT THE AGENT GETS
CitationsClaim + URL/title + per-source confidence and authority
Uncertainty statementHigh / moderate / low band derived from overall confidence
Conflict disclosuresEach detected disagreement with confidence and authority gap
No-answer phrasingPreset-specific lead when allow_answer: false (legal, support, research variants)

Your agent should render synthesis_contract.body or map the structured fields directly — not paraphrase from raw retrieval snippets.

CONFLICT DETECTION & RESOLUTION

Detection: pairwise compare qualifying evidence items. When title-token overlap is high and snippet polarity diverges (positive vs negative framing), we record a Conflict with confidence (max of the two items) and authority gap.

RULEDEFAULT THRESHOLDACTION
High-confidence conflict conflict ≥ 65% (legal: 60%) block — opposing claims too strong to clear
Low-confidence conflict conflict ≤ 45% (research: 50%) search_more — retrieve before answering
Authority-tier conflict authority gap ≥ 20% (support: 15%) escalate — human review across source tiers

When multiple rules fire, the strictest action wins (block > escalate > search_more). legal also sets block_on_conflicts as a blanket fail-safe.

POLICY PRESETS

Start with a named preset. Override thresholds via YAML or API for your workload.

QUALIFYING SOURCES (WHY LEGAL BLOCKS)

Policy does not count raw search hits. It counts qualifying sources after our evidence pass:

That is why legal can block at “2 qualifying sources; minimum is 3” even when the search API returned more raw results.

PRESETMIN CONFMIN SOURCESMIN AUTHON FAILUSE CASE
default50%1allow General agents; returns draft synthesis tagged as not policy-cleared
support60%2escalate Customer support; hand off to human when thin evidence
legal70%375%block Legal / compliance; blocks on conflict; no answer if bar not met
research55%2search_more Internal research; expand retrieval instead of hard-blocking
calibrated55%250%allow + tag Draft / internal tools — always answers with confidence band + policy warnings, never hard-blocks

Thresholds should be tuned on your query mix. Run the offline benchmark locally, or send 3–7 days of JSONL logs for a design-partner autopsy (48h report).

Example block response (legal preset):

{
  "allow_answer": false,
  "audit_id": "99c08409-28ec-4427-...",
  "policy": {
    "profile": "legal",
    "action": "block",
    "reasons": ["only 2 qualifying source(s); minimum is 3"]
  }
}

FAILURE MODES (UPFRONT)

False block (too strict)

Legal preset blocks a helpful support answer because evidence is thin or authority scores are low.

Mitigation: use support + escalate, lower min_sources in YAML, or run an autopsy to measure block rate on real queries.

Over-spend (probe too cautious)

Probe scores low on a factual query and fans out to 5 searches when 1 would suffice — policy may still allow the answer, but you paid extra.

Mitigation: calibrate probe thresholds on your corpus; autopsy quantifies over-search patterns by agent.

Under-search (probe too aggressive)

Probe stays at 1 query; evidence is thin; policy escalates or blocks. Spend is low but allow_answer is false — correct for risk, frustrating if preset is wrong for the workflow.

CALIBRATION PATHS

You do not need to send production logs to start evaluating the architecture.

PATHTRUST REQUIREDWHAT YOU GET
Offline benchmark None — runs locally python examples/gaia-baseline/run_benchmark.py --demo — mock backend, instant naive vs governed diff. Wikipedia + legal preset blocks 4/12 tasks in our published demo set.
CLI autopsy Your machine only query-fanout-validate-logs + query-fanout-autopsy on any JSONL export — no hosted upload.
Design-partner autopsy Send redacted logs 48h report: spend model, block/escalate patterns, recommended preset thresholds for your agents.

Published live benchmark: 20 production-style tasks, 100→26 searches (74% reduction) on Tavily — see landing stats. Numbers vary by corpus; treat as reference, not a SLA.

SEARCH PROVIDERS

Tavily is our reference benchmark preset. The control plane sits above your search API:

Policy, audit, and allow_answer are provider-agnostic. Swap the adapter; keep the governance layer.

CONFIGURATION & PRECEDENCE

METHODWHONOTES
API param policy=legalApp engineersNamed preset on POST /evidence
YAML policy_filePlatform / complianceWins over policy= when both are set — deterministic override
Hosted dashboardOperators/dashboard — read-only audit metrics; does not change policy

Precedence: policy_file (YAML) > policy API param > preset defaults. Set one source of truth per environment to avoid surprises.

from query_fanout import RetrievalClient

client = RetrievalClient(preset="tavily", policy="legal", agent_id="support-bot")
report = await client.retrieve("How do refund policies work?")

if report.allow_answer:
    answer(report.synthesis)
else:
    escalate(report.audit_id, report.policy.reasons)

Need soft answers instead of hard blocks? Use calibrated for draft generation (confidence band + warnings, always allow_answer: true), default for permissive general use, or research to expand retrieval before failing.

Ready with logs? Autopsy submission guide · autopsy@queryfanout.dev