Catch AI regressions before you ship
Start with the open-source runner. Run the three V1 packs for tool calling, RAG faithfulness, and model migrations locally or in CI, then publish a proof card only when you choose.
Run locally first
Install the open-source runner, create starter packs, and check prompt, agent, RAG, and model changes without uploading code.
Track what could break
Use A2ZAI radar to connect provider, model, API, and SDK changes to the packs your stack should run next.
Publish proof when ready
Local reports stay private by default. Publish a proof card only when you want a README, launch, or social artifact.
Local-first quick start
npx @a2zai-ai/driftcheck init npx @a2zai-ai/driftcheck check npx @a2zai-ai/driftcheck check --pack tool-calling
Clone the repo
Install the standalone utility package without cloning the full A2ZAI app repo.
Open DriftCheck →Run starter packs
Initialize the three V1 packs, run checks locally, and review the JSON and markdown reports.
Pack format docs →Publish when ready
Local reports stay private. Publish only when you want a proof card for GitHub or social launch posts.
View proof gallery →Starter packs
Pre-built packs you can run as-is or copy into the workbench and edit. Pick one below to load it.
Tool-Calling Reliability
Catch schema drift, hallucinated tool calls, and weak fallback behavior before agent changes ship.
Coding agents, PR workflows, patch quality and safety
Use in workbench →RAG Faithfulness
Catch unsupported claims, missing citations, and weak refusals in retrieval-augmented answers.
Coding agents, PR workflows, patch quality and safety
Use in workbench →Model Migration
Compare model migration behavior across quality, cost, latency, and safety before rollout.
Coding agents, PR workflows, patch quality and safety
Use in workbench →Live Support Eval
Run a real OpenAI-backed comparison and score the responses with Checks guardrails.
Support bots, billing assistants, live OpenAI comparison
Use in workbench →Agent Tool-Calling Guard
Protect tool-calling reliability, schema correctness, and fallback behavior for agent APIs.
Agent APIs, tool-calling reliability, schema and fallback behavior
Use in workbench →Model Latency and Cost Guard
Catch regressions from pricing/latency/model-routing changes before rollout.
Model routing, latency budgets, and token-cost protection
Use in workbench →Support Bot Guard
Catch regressions in answer quality, escalation behavior, latency, and cost.
Pre-computed support guardrails, escalation and safety
Use in workbench →Coding Agent PR Pack
Score a coding agent across patch quality, safety, speed, and cost.
Coding agents, PR workflows, patch quality and safety
Use in workbench →MVP workflow now live
GitHub-connected PR checks, manual runs, proof cards
`DriftCheck` now supports both manual scorecard runs and automatic GitHub App webhook runs on connected pull requests. Builders can connect a repo, let PR activity trigger checks automatically, and still run starter packs manually when they want to test a different workflow. Packs can evaluate actual case outputs with heuristic rules and, when configured, execute OpenAI-backed comparisons directly from the YAML pack.
First real workflow
Sign in to run your first Checks pack
The current MVP supports a manual repo connect flow: paste repo metadata, load a YAML starter pack, optionally execute OpenAI-backed cases from the pack itself, grade the outputs with heuristic rules, generate a PR scorecard, and publish a benchmark card.
Defined viral artifact
PR scorecard first, public proof card second
The artifact A2ZAI creates is not a generic dashboard. It is a visible object that travels naturally through GitHub, founder launches, and social sharing: a PR scorecard backed by a public proof card.
Viral Artifact
GitHub PR scorecard
A2ZAI Checks
Prompt regression check for `support-agent.yaml`
Quality
+8.4%
Latency
+220ms
Cost
-31%
Passing: `refund-policy`, `invoice-lookup`, `cancel-subscription`
Regressed: `edge-case-promotions` on `gpt-4.1-mini`
Recommendation: merge after fixing one retrieval prompt and rerunning the pack.
Public Card
Benchmark card
Repo benchmark
support-agent / checkout-recovery
Best model route
Claude Sonnet + GPT-4.1-mini fallback
Win summary
12% better success at 29% lower cost
This is the artifact that spreads on X, GitHub, and founder launches: a benchmark card builders can link to when they ship.
Public proof cards
Local runs builders chose to publish
Public showcase
Recent benchmark runs
- krishnaadavi/a2zai
A2ZAI Builder Radar Guard
Overall score improved from 78 to 88. Biggest movement came from cost.
+10Score 88 - krishnaadavi/a2zai
A2ZAI Builder Radar Guard
Overall score improved from 78 to 88. Biggest movement came from cost.
+10Score 88 - krishnaadavi/a2zai
A2ZAI Builder Radar Guard
Overall score improved from 78 to 88. Biggest movement came from cost.
+10Score 88 - krishnaadavi/a2zai
A2ZAI Builder Radar Guard
Overall score improved from 78 to 88. Biggest movement came from cost.
+10Score 88 - krishnaadavi/a2zai
Live Execution Smoke Test
Overall score regressed from 72 to 69. Biggest movement came from quality. One dimension still regressed and needs review before merge.
-3Score 69 - krishnaadavi/a2zai
Live Execution Smoke Test
Overall score regressed from 83 to 72. Biggest movement came from quality. One dimension still regressed and needs review before merge.
-11Score 72
30-day MVP scope
Week 1
Builder radar positioning
- Reframe the site around builders shipping with models, APIs, SDKs, and agents.
- Tighten the river and briefs around releases, benchmarks, pricing, outages, and deprecations.
- Publish the DriftCheck narrative and the benchmark artifact preview.
Week 2
First GitHub workflow
- Support one happy path: repo connect, YAML test pack, PR comment output.
- Score quality, safety, latency, and cost deltas from before vs after.
- Generate a proof card page builders can link in launch posts and READMEs.
Week 3
MVP launch loop
- Ship starter packs for support bots, coding agents, and retrieval workflows.
- Create proof-card showcase pages for standout repos and agents using DriftCheck.
- Turn real eval wins and regressions into quick bytes and brief coverage.
Week 4
Distribution and iteration
- Publish example repos, proof cards, and operator writeups.
- Use the agent index as the submission and distribution surface for builders.
- Tune onboarding around the fastest path from install to first PR scorecard.
Distribution loop
Builder connects repo and runs Checks on a prompt or agent PR.
A2ZAI posts a scorecard comment with score deltas and failing examples.
Builder shares the public proof card in X, GitHub, or product launch posts.
A2ZAI features the best cards in the river, briefs, and agent showcase.
New builders arrive from those artifacts and install Checks in their own repos.
How the rest of A2ZAI fits
`Live River` explains the launches, pricing changes, and outages that might move benchmark results.
`Model pages` become compatibility and performance context for builder decisions.
`Agent Index` becomes the showcase surface for builders shipping with Checks.
`Briefs` become the weekly summary of what changed before your app breaks.