Chosen wedge: A2ZAI Checks

GitHub-native evals for prompts and agents

A2ZAI is becoming builder radar plus one sharp utility. That utility is `Checks`: run evals on every PR, catch regressions before merge, and generate benchmark artifacts builders actually want to share.

Run on every PR

Connect a repo, point A2ZAI at a prompt pack or agent workflow, and get pass/fail deltas before merge.

Track what changed for builders

Use the river and model pages to understand whether a provider launch, pricing shift, or outage changed your results.

Ship a shareable artifact

Every run generates a PR scorecard and a public benchmark card that teams can post, compare, and reuse.

MVP workflow now live

GitHub-connected PR checks, manual runs, benchmark cards

`A2ZAI Checks` now supports both manual scorecard runs and automatic GitHub App webhook runs on connected pull requests. Builders can connect a repo, let PR activity trigger Checks automatically, and still run starter packs manually when they want to test a different workflow.

First real workflow

Sign in to run your first Checks pack

The current MVP supports a manual repo connect flow: paste repo metadata, load a YAML starter pack, generate a PR scorecard, and publish a benchmark card.

Defined viral artifact

PR scorecard first, public benchmark card second

The artifact A2ZAI creates is not a generic dashboard. It is a visible object that travels naturally through GitHub, founder launches, and social sharing: a PR scorecard backed by a public benchmark card.

Viral Artifact

GitHub PR scorecard

Shareable PR comment

A2ZAI Checks

Prompt regression check for `support-agent.yaml`

Quality

+8.4%

Latency

+220ms

Cost

-31%

Passing: `refund-policy`, `invoice-lookup`, `cancel-subscription`

Regressed: `edge-case-promotions` on `gpt-4.1-mini`

Recommendation: merge after fixing one retrieval prompt and rerunning the pack.

Public Card

Benchmark card

Linkable showcase

Repo benchmark

support-agent / checkout-recovery

128 eval cases

Best model route

Claude Sonnet + GPT-4.1-mini fallback

Win summary

12% better success at 29% lower cost

Pass rate 94%Safety stable1 flaky case

This is the artifact that spreads on X, GitHub, and founder launches: a benchmark card builders can link to when they ship.

30-day MVP scope

Week 1

Builder radar positioning

  • Reframe the site around builders shipping with models, APIs, SDKs, and agents.
  • Tighten the river and briefs around releases, benchmarks, pricing, outages, and deprecations.
  • Publish the A2ZAI Checks narrative and the benchmark artifact preview.

Week 2

First GitHub workflow

  • Support one happy path: repo connect, YAML test pack, PR comment output.
  • Score quality, safety, latency, and cost deltas from before vs after.
  • Generate a benchmark card page builders can link in launch posts and READMEs.

Week 3

MVP launch loop

  • Ship starter packs for support bots, coding agents, and retrieval workflows.
  • Create showcase pages for standout repos and agents using A2ZAI Checks.
  • Turn real eval wins and regressions into quick bytes and brief coverage.

Week 4

Distribution and iteration

  • Publish example repos, benchmark cards, and operator writeups.
  • Use the agent index as the submission and distribution surface for builders.
  • Tune onboarding around the fastest path from install to first PR scorecard.

Distribution loop

1

Builder connects repo and runs Checks on a prompt or agent PR.

2

A2ZAI posts a scorecard comment with score deltas and failing examples.

3

Builder shares the public benchmark card in X, GitHub, or product launch posts.

4

A2ZAI features the best cards in the river, briefs, and agent showcase.

5

New builders arrive from those artifacts and install Checks in their own repos.

How the rest of A2ZAI fits

`Live River` explains the launches, pricing changes, and outages that might move benchmark results.

`Model pages` become compatibility and performance context for builder decisions.

`Agent Index` becomes the showcase surface for builders shipping with Checks.

`Briefs` become the weekly summary of what changed before your app breaks.