Public benchmark card

krishnaadavi/a2zai

Coding Agent PR Pack

Overall score improved from 74 to 86. Biggest movement came from quality.

Before

74

After

86

Delta

+12

Run status

completed

Why this artifact is shareable

Best improvement

quality

+17

Dimensions improved

4

out of 4 measured dimensions

Main risk

latency

7

Public URL ready to share1200 x 630 social card export readyGitHub-native eval artifactNo regressed dimensions

Suggested launch post

Copy this when sharing the benchmark on X, GitHub, launch posts, or team chats.

A2ZAI Checks: krishnaadavi/a2zai
Coding Agent PR Pack finished at 86 (+12 vs baseline).
Best gain: quality +17.
No failing examples were detected in this run.
https://a2zai.ai/checks/benchmarks/krishnaadavi-a2zai-coding-agent-pr-pack

Benchmark URL: https://a2zai.ai/checks/benchmarks/krishnaadavi-a2zai-coding-agent-pr-pack

Social card: https://a2zai.ai/checks/benchmarks/krishnaadavi-a2zai-coding-agent-pr-pack/opengraph-image

Add to README

Link to this benchmark from your repo README so visitors see your eval results.

Badge (markdown)

[![A2ZAI Checks](https://a2zai.ai/checks/benchmarks/krishnaadavi-a2zai-coding-agent-pr-pack/opengraph-image)](https://a2zai.ai/checks/benchmarks/krishnaadavi-a2zai-coding-agent-pr-pack)

Link (markdown)

[Benchmark: Coding Agent PR Pack](https://a2zai.ai/checks/benchmarks/krishnaadavi-a2zai-coding-agent-pr-pack)

Dimension scorecard

quality

71 -> 88

+17

safety

80 -> 89

+9

latency

76 -> 83

+7

cost

66 -> 78

+12

PR scorecard output

## A2ZAI Checks Scorecard

Repo: `krishnaadavi/a2zai` • PR #2
Pack: `Coding Agent PR Pack`

Overall: **74 -> 86** (+12)

### Dimension deltas
- quality: 71 -> 88 (+17)
- safety: 80 -> 89 (+9)
- latency: 76 -> 83 (+7)
- cost: 66 -> 78 (+12)

Public benchmark card: https://a2zai.ai/checks/benchmarks/krishnaadavi-a2zai-coding-agent-pr-pack

Run context

Repo: krishnaadavi/a2zai

Branch: main -> checks-writeback-test-1

PR: #2

Created: 3/12/2026, 5:29:18 PM

GitHub comment: posted successfully ↗

Cases to review

No failing examples were detected in this run.