Governance is tuned on your query mix — not shipped as one-size-fits-all thresholds. Two separate decisions per request: how much to search (cost) and whether to answer (risk).
flowchart LR
A[AGENT
your loop] --> B[PROBE
1 search]
B --> C[RETRIEVE
0-N more]
C --> D[POLICY
gate]
D --> E[ALLOW
answer]
D --> F[ESCALATE
block]
E --> G[AUDIT
audit_id]
F --> G
classDef blue fill:#3b82f6,stroke:#1e40af,color:#fff
classDef yellow fill:#eab308,stroke:#854d0e,color:#000
classDef green fill:#22c55e,stroke:#166534,color:#fff
class B,C blue
class D yellow
class E,F,G green
Blue steps route spend. Gold steps gate risk. Both paths end with an audit_id and structured policy reasons.
Probe / fanout — cost router
Scores the first result batch, then decides whether to fan out. Starting defaults: 1 query above ~78% probe score, up to 3 above ~55%, full fanout below — derived from our 20-task Tavily benchmark corpus, not tuned for your tenant out of the box.
Thresholds are overrideable per deployment. Calibrate on your logs (below) before trusting them in production.
Policy / allow_answer — risk gate
Runs on whatever evidence was retrieved. Checks overall confidence, qualifying source count, authority, and conflicts.
This is what legal and compliance stakeholders review — independent of probe routing.
Retrieval governance is not enough if the agent layer free-formats answers. Every cleared response is wrapped in a synthesis contract (also returned as synthesis_contract JSON on POST /evidence):
| REQUIRED | WHAT THE AGENT GETS |
|---|---|
| Citations | Claim + URL/title + per-source confidence and authority |
| Uncertainty statement | High / moderate / low band derived from overall confidence |
| Conflict disclosures | Each detected disagreement with confidence and authority gap |
| No-answer phrasing | Preset-specific lead when allow_answer: false (legal, support, research variants) |
Your agent should render synthesis_contract.body or map the structured fields directly — not paraphrase from raw retrieval snippets.
Detection: pairwise compare qualifying evidence items. When title-token overlap is high and snippet polarity diverges (positive vs negative framing), we record a Conflict with confidence (max of the two items) and authority gap.
| RULE | DEFAULT THRESHOLD | ACTION |
|---|---|---|
| High-confidence conflict | conflict ≥ 65% (legal: 60%) | block — opposing claims too strong to clear |
| Low-confidence conflict | conflict ≤ 45% (research: 50%) | search_more — retrieve before answering |
| Authority-tier conflict | authority gap ≥ 20% (support: 15%) | escalate — human review across source tiers |
When multiple rules fire, the strictest action wins (block > escalate > search_more). legal also sets block_on_conflicts as a blanket fail-safe.
Start with a named preset. Override thresholds via YAML or API for your workload.
Policy does not count raw search hits. It counts qualifying sources after our evidence pass:
That is why legal can block at “2 qualifying sources; minimum is 3” even when the search API returned more raw results.
| PRESET | MIN CONF | MIN SOURCES | MIN AUTH | ON FAIL | USE CASE |
|---|---|---|---|---|---|
default | 50% | 1 | — | allow | General agents; returns draft synthesis tagged as not policy-cleared |
support | 60% | 2 | — | escalate | Customer support; hand off to human when thin evidence |
legal | 70% | 3 | 75% | block | Legal / compliance; blocks on conflict; no answer if bar not met |
research | 55% | 2 | — | search_more | Internal research; expand retrieval instead of hard-blocking |
calibrated | 55% | 2 | 50% | allow + tag | Draft / internal tools — always answers with confidence band + policy warnings, never hard-blocks |
Thresholds should be tuned on your query mix. Run the offline benchmark locally, or send 3–7 days of JSONL logs for a design-partner autopsy (48h report).
Example block response (legal preset):
{
"allow_answer": false,
"audit_id": "99c08409-28ec-4427-...",
"policy": {
"profile": "legal",
"action": "block",
"reasons": ["only 2 qualifying source(s); minimum is 3"]
}
}
False block (too strict)
Legal preset blocks a helpful support answer because evidence is thin or authority scores are low.
Mitigation: use support + escalate, lower min_sources in YAML, or run an autopsy to measure block rate on real queries.
Over-spend (probe too cautious)
Probe scores low on a factual query and fans out to 5 searches when 1 would suffice — policy may still allow the answer, but you paid extra.
Mitigation: calibrate probe thresholds on your corpus; autopsy quantifies over-search patterns by agent.
Under-search (probe too aggressive)
Probe stays at 1 query; evidence is thin; policy escalates or blocks. Spend is low but allow_answer is false — correct for risk, frustrating if preset is wrong for the workflow.
You do not need to send production logs to start evaluating the architecture.
| PATH | TRUST REQUIRED | WHAT YOU GET |
|---|---|---|
| Offline benchmark | None — runs locally | python examples/gaia-baseline/run_benchmark.py --demo — mock backend, instant naive vs governed diff. Wikipedia + legal preset blocks 4/12 tasks in our published demo set. |
| CLI autopsy | Your machine only | query-fanout-validate-logs + query-fanout-autopsy on any JSONL export — no hosted upload. |
| Design-partner autopsy | Send redacted logs | 48h report: spend model, block/escalate patterns, recommended preset thresholds for your agents. |
Published live benchmark: 20 production-style tasks, 100→26 searches (74% reduction) on Tavily — see landing stats. Numbers vary by corpus; treat as reference, not a SLA.
Tavily is our reference benchmark preset. The control plane sits above your search API:
mock — offline control plane: deterministic fake results for CI, local dev, and running probe/policy/audit without any search API key. Not just a test double; full pipeline works end-to-end.wikipedia — offline real retrieval for demos and benchmarkstavily — live paid search (reference benchmark)http backend in YAML for Exa, Brave, Serper, or internal searchPolicy, audit, and allow_answer are provider-agnostic. Swap the adapter; keep the governance layer.
| METHOD | WHO | NOTES |
|---|---|---|
API param policy=legal | App engineers | Named preset on POST /evidence |
YAML policy_file | Platform / compliance | Wins over policy= when both are set — deterministic override |
| Hosted dashboard | Operators | /dashboard — read-only audit metrics; does not change policy |
Precedence: policy_file (YAML) > policy API param > preset defaults. Set one source of truth per environment to avoid surprises.
from query_fanout import RetrievalClient
client = RetrievalClient(preset="tavily", policy="legal", agent_id="support-bot")
report = await client.retrieve("How do refund policies work?")
if report.allow_answer:
answer(report.synthesis)
else:
escalate(report.audit_id, report.policy.reasons)
Need soft answers instead of hard blocks? Use calibrated for draft generation (confidence band + warnings, always allow_answer: true), default for permissive general use, or research to expand retrieval before failing.
Ready with logs? Autopsy submission guide · autopsy@queryfanout.dev