HS logo
back to work
case study · Jun 2025 – Aug 2025

VeriFlow

A QA automation copilot that turns Azure DevOps requirements (or manual user stories) into runnable Playwright browser tests, executes them, and presents structured, reviewable results.

Node.jsExpressPlaywrightClaudeDockerAzure DevOps

The problem

Requirements live in user stories; automation lives in separate test code. VeriFlow closes that gap by generating executable browser tests directly from requirement text and making the outcomes easy to review.

What it does

  • Imports Azure DevOps work items via WIQL to pull user stories and acceptance criteria directly.
  • Generates a test plan, test cases, and a runnable Playwright spec (ES module syntax) from requirements plus acceptance criteria.
  • Executes the generated suite against a target URL.
  • Reports structured results: totals, durations, and per-test pass/fail/skip outcomes.
  • Produces plain-English run summaries (AI-generated, with a deterministic fallback) and persists refresh-safe run history with filtering. Scheduled reruns (5m / 30m / 1h) run while the server is active.

Approach & architecture

Claude turns a requirement into a test plan, test cases, and a Playwright spec written to generated.spec.js. The suite runs via Playwright with its JSON reporter, which produces an authoritative structured report. Run success is derived from real pass/fail/skip totals rather than scraped stdout, and history is persisted so reviews survive a refresh.

how it fits together

VeriFlow · architecture

Requirements → sandboxed, self-healing browser tests

Turns user stories into runnable Playwright tests, executes them safely, and reports results you can trust.

  1. Requirement intake
    Azure DevOps work item (WIQL) or a manual user story.
  2. Claude test generation
    Produces a test plan, test cases, and a runnable Playwright spec from acceptance criteria.
  3. Sandboxed execution
    Docker (dropped caps, no secrets) or hardened process fallback. LLM-generated code is treated as untrusted and isolated from the host.
  4. Playwright JSON reporter
    Authoritative pass / fail / skip totals — no brittle stdout scraping.
  5. AI self-heal (on selector failure)
    Capture live DOM → Claude proposes one replacement selector → rewrite spec → re-run once (bounded).
  6. Structured report + traceability matrix
    Each acceptance criterion mapped to its covering test(s) + pass/fail.
  7. Run summary + persisted history
    Plain-English summary (with fallback); refresh-safe history; scheduled reruns.
Node.js · Express · Playwright · Claude · Docker · Azure DevOps · GitHub Actions

product screens

Test cases generated from a user story and its acceptance criteria.
Test cases generated from a user story and its acceptance criteria.
The runnable Playwright spec Claude generated for the requirement.
The runnable Playwright spec Claude generated for the requirement.
Execution results: authoritative pass/fail totals, the sandbox mode, and a requirement → test traceability matrix.
Execution results: authoritative pass/fail totals, the sandbox mode, and a requirement → test traceability matrix.

key engineering decisions

Trustworthy result reporting

Replaced brittle reporter-text scraping with Playwright's JSON reporter and derived success from authoritative pass/fail/skip totals, fixing a false-positive where runs with zero tests reported as "passed." Verified across all-pass, mixed, and zero-test scenarios.

Resilient run pipeline

Parses the structured report in both the success and non-zero-exit paths, and persists history for refresh-safe review.

results & outcomes

  • Requirements become executable tests; QA becomes repeatable and verifiable.
  • Eliminated a false-positive that reported zero-test runs as passing.
  • Plain-English summaries, persisted history, and scheduled reruns make results easy to review over time.