case study · Feb 2026 – Apr 2026

Evident

An evidence-grounded AI decision system that ranks outreach targets, cites why, and explicitly refuses to recommend when confidence is too low, with bounded cost and a full audit trail.

PythonFastAPIClaudePlaywrightDockerAWS ECS

system diagram github repo

The problem

LLMs tend to produce confident-sounding answers even when the evidence behind them is thin. For a system that decides who is worth contacting, that failure mode isn't just a bad output, it's a bad decision made at scale. Evident is built around the opposite principle: every decision is grounded in retrieved public evidence, the reasoning is exposed, and the system explicitly returns "insufficient evidence" rather than guessing. It also caps API spend so runs stay predictable.

What it does

Takes a faculty/people directory URL plus a research interest as input.
Returns a ranked shortlist of contacts, each with reasoning, cited evidence, and a personalized outreach draft.
Produces three explicit outcomes per contact: recommended, not_recommended, or insufficient_evidence (an explicit refusal).
Runs a deterministic pre-filter to remove weak candidates before spending model budget.
Streams live progress to the UI via Server-Sent Events.

Approach & architecture

Ingestion is pluggable behind a ContactSource interface: the built-in deterministic faculty parser is one source, and a vendored, profile-driven scraper engine (evidence_scraper) is another, letting Evident pull contacts from many more site layouts than a single hardcoded parser. Contacts are cleaned and enriched with evidence chunks and identity scoring, then passed through a deterministic pre-filter. Only the survivors reach the LLM evaluation step. Uncertain contacts enter a bounded agentic loop: at most one adaptive retrieval pass and one re-evaluation, then the system finalizes, explainable, not open-ended.

how it fits together

Evident · architecture

Evidence-grounded decision pipeline

Retrieval-grounded evaluation that ranks outreach targets, cites its evidence, and refuses when support is too thin.

Contact source (pluggable)
Directory parser or evidence_scraper → RawContact[]: name, title, email, research text, evidence, identity signals.
Clean & enrich
Dedup, evidence chunks, identity scoring.
Deterministic pre-filter
Drops weak candidates before any model spend.
LLM evaluation · triage model
Cheap first pass over the shortlist (Claude, retrieval-backed).
Refuse-when-weak gate
Structural floor enforced across every path, so an over-confident model cannot upgrade a thin contact.
Bounded loop · escalate uncertain
≤1 adaptive retrieval + 1 re-eval on the primary model, then finalize.
Hybrid rank → drafts (top only)
AI fit + evidence strength + seniority · persist + full audit trail.

recommended

strong fit + support

not recommended

below threshold

insufficient evidence

explicit refusal

Python · FastAPI · Playwright · Claude · SQLite/Postgres · Docker · AWS ECS/Fargate

product screens

Workspace: a ranked shortlist with the selected case file, recommendation, confidence, evidence, and reasoning.

Case file: cited evidence and an audit-style match / gap / evidence breakdown for the decision.

Run insights: confidence mix, evidence quality, and the cost panel, estimated USD, cost-per-recommended, and tiered-routing.

key engineering decisions

A single refuse-when-weak gate across every path

The uncertainty gate is applied uniformly across the LLM path, the heuristic fallback, and the second pass, so an over-confident model can't upgrade a thin contact to "recommended."

Cost-safe LLM usage

Explicit timeouts, automatic retries with backoff, a deterministic pre-filter to avoid wasted calls, and crash-safe response parsing. Per-run caps bound evaluations, drafts, retries, and outbound fetches.

Pluggable ingestion engine

A vendored profile-driven Playwright scraper (evidence_scraper) validated against 12+ real directory layouts across universities and law firms, with discovery hardening: timeouts, scope limits, a lean tool schema, anchor-text recovery, and SPA hydration waits.

Bounded agentic loop

At most one adaptive retrieval plus one re-evaluation before finalizing. This keeps the reasoning explainable and the cost predictable instead of allowing an open-ended agent to spiral.

results & outcomes

Ranked targets with cited reasoning, explicit refusals on weak evidence, and roughly 60% fewer unnecessary model calls.
Full auditability per contact: score breakdown, cited evidence, confidence justification, and decision-revision history.
Deployed as a Docker image on AWS ECS/Fargate, run on-demand to control cost, with the API key injected via AWS Secrets Manager.

deep dive

Teaching an AI system to say "I don't know"

How Evident grounds every decision in retrieved evidence, uses a bounded agentic loop, and refuses to recommend when the evidence is too thin.

back to all work