GitHub-native evals for prompts and agents
A2ZAI is becoming builder radar plus one sharp utility. That utility is `Checks`: run evals on every PR, catch regressions before merge, and generate benchmark artifacts builders actually want to share.
Run on every PR
Connect a repo, point A2ZAI at a prompt pack or agent workflow, and get pass/fail deltas before merge.
Track what changed for builders
Use the river and model pages to understand whether a provider launch, pricing shift, or outage changed your results.
Ship a shareable artifact
Every run generates a PR scorecard and a public benchmark card that teams can post, compare, and reuse.
MVP workflow now live
GitHub-connected PR checks, manual runs, benchmark cards
`A2ZAI Checks` now supports both manual scorecard runs and automatic GitHub App webhook runs on connected pull requests. Builders can connect a repo, let PR activity trigger Checks automatically, and still run starter packs manually when they want to test a different workflow.
First real workflow
Sign in to run your first Checks pack
The current MVP supports a manual repo connect flow: paste repo metadata, load a YAML starter pack, generate a PR scorecard, and publish a benchmark card.
Defined viral artifact
PR scorecard first, public benchmark card second
The artifact A2ZAI creates is not a generic dashboard. It is a visible object that travels naturally through GitHub, founder launches, and social sharing: a PR scorecard backed by a public benchmark card.
Viral Artifact
GitHub PR scorecard
A2ZAI Checks
Prompt regression check for `support-agent.yaml`
Quality
+8.4%
Latency
+220ms
Cost
-31%
Passing: `refund-policy`, `invoice-lookup`, `cancel-subscription`
Regressed: `edge-case-promotions` on `gpt-4.1-mini`
Recommendation: merge after fixing one retrieval prompt and rerunning the pack.
Public Card
Benchmark card
Repo benchmark
support-agent / checkout-recovery
Best model route
Claude Sonnet + GPT-4.1-mini fallback
Win summary
12% better success at 29% lower cost
This is the artifact that spreads on X, GitHub, and founder launches: a benchmark card builders can link to when they ship.
30-day MVP scope
Week 1
Builder radar positioning
- Reframe the site around builders shipping with models, APIs, SDKs, and agents.
- Tighten the river and briefs around releases, benchmarks, pricing, outages, and deprecations.
- Publish the A2ZAI Checks narrative and the benchmark artifact preview.
Week 2
First GitHub workflow
- Support one happy path: repo connect, YAML test pack, PR comment output.
- Score quality, safety, latency, and cost deltas from before vs after.
- Generate a benchmark card page builders can link in launch posts and READMEs.
Week 3
MVP launch loop
- Ship starter packs for support bots, coding agents, and retrieval workflows.
- Create showcase pages for standout repos and agents using A2ZAI Checks.
- Turn real eval wins and regressions into quick bytes and brief coverage.
Week 4
Distribution and iteration
- Publish example repos, benchmark cards, and operator writeups.
- Use the agent index as the submission and distribution surface for builders.
- Tune onboarding around the fastest path from install to first PR scorecard.
Distribution loop
Builder connects repo and runs Checks on a prompt or agent PR.
A2ZAI posts a scorecard comment with score deltas and failing examples.
Builder shares the public benchmark card in X, GitHub, or product launch posts.
A2ZAI features the best cards in the river, briefs, and agent showcase.
New builders arrive from those artifacts and install Checks in their own repos.
How the rest of A2ZAI fits
`Live River` explains the launches, pricing changes, and outages that might move benchmark results.
`Model pages` become compatibility and performance context for builder decisions.
`Agent Index` becomes the showcase surface for builders shipping with Checks.
`Briefs` become the weekly summary of what changed before your app breaks.