HS logo
back to work
case study · Feb 2026 – Apr 2026

Evident

An evidence-grounded AI decision system that ranks outreach targets, cites why, and explicitly refuses to recommend when confidence is too low, with bounded cost and a full audit trail.

PythonFastAPIClaudePlaywrightDockerAWS ECS

The problem

LLMs tend to produce confident-sounding answers even when the evidence behind them is thin. For a system that decides who is worth contacting, that failure mode isn't just a bad output, it's a bad decision made at scale. Evident is built around the opposite principle: every decision is grounded in retrieved public evidence, the reasoning is exposed, and the system explicitly returns "insufficient evidence" rather than guessing. It also caps API spend so runs stay predictable.

What it does

  • Takes a faculty/people directory URL plus a research interest as input.
  • Returns a ranked shortlist of contacts, each with reasoning, cited evidence, and a personalized outreach draft.
  • Produces three explicit outcomes per contact: recommended, not_recommended, or insufficient_evidence (an explicit refusal).
  • Runs a deterministic pre-filter to remove weak candidates before spending model budget.
  • Streams live progress to the UI via Server-Sent Events.

Approach & architecture

Ingestion is pluggable behind a ContactSource interface: the built-in deterministic faculty parser is one source, and a vendored, profile-driven scraper engine (evidence_scraper) is another, letting Evident pull contacts from many more site layouts than a single hardcoded parser. Contacts are cleaned and enriched with evidence chunks and identity scoring, then passed through a deterministic pre-filter. Only the survivors reach the LLM evaluation step. Uncertain contacts enter a bounded agentic loop: at most one adaptive retrieval pass and one re-evaluation, then the system finalizes, explainable, not open-ended.

how it fits together

Evident · architecture

Evidence-grounded decision pipeline

Retrieval-grounded evaluation that ranks outreach targets, cites its evidence, and refuses when support is too thin.

  1. Contact source (pluggable)
    Directory parser or evidence_scraper → RawContact[]: name, title, email, research text, evidence, identity signals.
  2. Clean & enrich
    Dedup, evidence chunks, identity scoring.
  3. Deterministic pre-filter
    Drops weak candidates before any model spend.
  4. LLM evaluation · triage model
    Cheap first pass over the shortlist (Claude, retrieval-backed).
  5. Refuse-when-weak gate
    Structural floor enforced across every path — an over-confident model cannot upgrade a thin contact.
  6. Bounded loop · escalate uncertain
    ≤1 adaptive retrieval + 1 re-eval on the primary model, then finalize.
  7. Hybrid rank → drafts (top only)
    AI fit + evidence strength + seniority · persist + full audit trail.
recommended
strong fit + support
not recommended
below threshold
insufficient evidence
explicit refusal
Python · FastAPI · Playwright · Claude · SQLite/Postgres · Docker · AWS ECS/Fargate

product screens

Workspace: a ranked shortlist with the selected case file, recommendation, confidence, evidence, and reasoning.
Workspace: a ranked shortlist with the selected case file, recommendation, confidence, evidence, and reasoning.
Case file: cited evidence and an audit-style match / gap / evidence breakdown for the decision.
Case file: cited evidence and an audit-style match / gap / evidence breakdown for the decision.
A personalized outreach draft generated only for recommended contacts, grounded in the retrieved evidence.
A personalized outreach draft generated only for recommended contacts, grounded in the retrieved evidence.
Run insights: confidence mix, evidence quality, and the cost panel, estimated USD, cost-per-recommended, and tiered-routing.
Run insights: confidence mix, evidence quality, and the cost panel, estimated USD, cost-per-recommended, and tiered-routing.

key engineering decisions

A single refuse-when-weak gate across every path

The uncertainty gate is applied uniformly across the LLM path, the heuristic fallback, and the second pass, so an over-confident model can't upgrade a thin contact to "recommended."

Cost-safe LLM usage

Explicit timeouts, automatic retries with backoff, a deterministic pre-filter to avoid wasted calls, and crash-safe response parsing. Per-run caps bound evaluations, drafts, retries, and outbound fetches.

Pluggable ingestion engine

A vendored profile-driven Playwright scraper (evidence_scraper) validated against 12+ real directory layouts across universities and law firms, with discovery hardening: timeouts, scope limits, a lean tool schema, anchor-text recovery, and SPA hydration waits.

Bounded agentic loop

At most one adaptive retrieval plus one re-evaluation before finalizing. This keeps the reasoning explainable and the cost predictable instead of allowing an open-ended agent to spiral.

results & outcomes

  • Ranked targets with cited reasoning, explicit refusals on weak evidence, and roughly 60% fewer unnecessary model calls.
  • Full auditability per contact: score breakdown, cited evidence, confidence justification, and decision-revision history.
  • Deployed as a Docker image on AWS ECS/Fargate, run on-demand to control cost, with the API key injected via AWS Secrets Manager.
deep dive
Teaching an AI system to say “I don't know”

A short writeup on evidence grounding, the single refusal gate, and the bounded agentic loop.