What 295 Agentic PRs Taught Us About Code Review

Building AI-native team · · Yi Zhang (CEO & Founder)

When AI writes more of the code, code review becomes more important, not less. The failure mode is rarely dramatic. It is usually ordinary and expensive: plausible patches, incomplete fixes, brittle edge cases, weak self-review, and premature “done” states.

Reading every generated line does not scale. Skipping review is how AI slop becomes production debt.

That is why we built Crosscheck: an open-source assistant workflow for independent AI code review. It watches PRs, runs review, applies targeted fixes when configured, re-checks the result, and handles conflict-resolution steps when the automated path is safe.

Crosscheck logo

Crosscheck is open source. Install @humanbased/crosscheck from npm, connect GitHub plus Codex or Claude Code, and run review loops from your own machine or server.

The product choice is deliberately pragmatic. Crosscheck drives Codex and Claude Code directly for review, fix, re-check, and conflict-resolution steps. Teams can use the CLI tools and subscriptions they already pay for, instead of moving every review into a new hosted service or a separate per-review API bill.

This analysis asks a practical question: which workflow choices protect quality without slowing the team down?

What we analyzed

We joined GitHub PR metadata for humanbased-ai/monorepo with retained local Crosscheck logs. The repository was created on 2026-04-27T10:08:41Z; the first PR in the dataset was opened on April 30; the analysis window runs through June 5, 2026.

The full PR population covers 295 pull requests. The retained Crosscheck logs cover May 24 to June 5, giving us a smaller workflow-observed subset of 96 PRs.

Cumulative PRs from monorepo creation to June 5, 2026
Cumulative PRs since monorepo creation. The repo went from zero to 295 PRs in 40 calendar days.

The core measurements:

MeasureValue
PRs since repo creation295
Non-doc PRs286
PRs in retained Crosscheck logs96
Recorded workflow minutes2,948.8
Recorded review hours49.1
24-hour days of workflow2.0
Recorded tokens1,798,577
Unique ticket refs157

We separate the goal from the implementation:

Problem complexity versus solution complexity for Crosscheck-observed PRs
Each dot is a Crosscheck-observed non-doc PR. Problem complexity and solution complexity are related, but not interchangeable.

Finding 1: route review from demand and PR shape

The strongest product lesson is not “which telemetry field explains cost afterward.” The useful question is: which setup choices make a PR likely to need more review before we spend the review budget?

In this sample, ticket bundling was the clearest controllable cost signal. Problem complexity and solution complexity also mattered, but they describe different layers. The issue brings demand complexity. The coding agent creates one solution shape.

Crosscheck should route review strength from both.

Directional correlations between actionable workflow levers, CR cost, and review risk
Directional correlations from the retained Crosscheck window. These are routing signals, not causal estimates.

A practical router should classify the issue, estimate PR shape, capture coding-agent and model provenance, then choose a review lane:

Crosscheck already has workflow tiers such as balanced and thorough. The next product step is to make review strength an automatic routing decision, not a manual habit.

The most practical process question is simple: should we ship a larger PR covering multiple tickets, or keep PRs small and run more review loops?

The answer is not “always split.” In the retained Crosscheck window, PRs with 2-3 ticket refs looked comparable to one-ticket PRs on median computer time and CR rounds, while preserving approve rate. That is a useful bundle size when the tickets form one coherent unit.

The warning sign is 4+ tickets. Those PRs became a different operating mode: higher median computer time, more review/fix time, more rounds, and a much lower approve rate. At that point, bundling stops saving coordination cost and starts hiding risk.

Ticket bundling trade-off across workflow minutes, approve rate, and risk proxy
Bundling can reduce coordination overhead up to a point. Past that point, the workflow becomes harder to stabilize.

Median per-PR comparison:

BundleSampleProblem / solutionWorkflow / PRReview + fix / PRCR rounds / PRWall-clock / PRTokens / PRApprove
1 ticket19 PRs / 19 tickets28.5 / 43.334.6m19.2m + 1.6m927.4h13,60963%
2-3 tickets10 PRs / 20 tickets32.3 / 43.538.7m11.6m + 1.1m957.1h9,59260%
4+ tickets11 PRs / 69 tickets52.6 / 82.086.4m46.5m + 10.7m1341.5h20,56718%

The operating rule we took from this:

Bundle tickets when they are one evidence unit: same domain, same rollout, shared migration, or one acceptance flow.

Split when they cross domains, need different reviewers, or deserve separate rollback stories.

Finding 3: do not rank coding agents without provenance

The observed agent comparison is useful, but not fair enough to crown a winner.

Codex-origin PRs in this sample were mostly smaller site and product-polish changes. Claude-origin PRs carried more ticket refs, higher problem scores, and more backend/domain-risk work. The measured outcomes reflect task assignment as much as agent ability.

Average problem and solution complexity by inferred PR origin
Origin is inferred, not authoritative. The product opportunity is to make this metadata first-class.

The origin proxy table shows the bias:

Origin proxyPRsTicket refsAvg problemAvg solutionAvg workflowAvg roundsApprove
Claude308933.455.772.2m11.633%
Codex28813.835.912.2m4.579%
Human111127.840.737.2m8.782%
Unknown20421.140.40.0m0.00%

The conclusion is not “Codex is better” or “Claude is worse.” The conclusion is that Crosscheck needs first-class provenance for:

Without that, task assignment bias overwhelms model choice.

What model comparison can tell us today

Reviewer model routing is more actionable than original coding-agent ranking because Crosscheck owns that part of the workflow.

In retained review-call logs, Claude Sonnet review calls were much faster and used fewer recorded tokens than Claude Opus calls. Opus appeared on somewhat harder PRs, so this is not a quality ranking. It is a routing hint.

Reviewer setupCallsPRsAvg problemAvg durationAvg tokensApproveNeeds workBlock
Claude / Sonnet 4.6613017.063.2s2,19462%36%2%
Claude / Opus 4.7141324.4162.4s9,10650%36%14%

The practical policy: use smaller/faster models by default, then escalate to stronger models for high-risk PRs, reviewer disagreement, or repeated fix loops.

What this means for Crosscheck

Crosscheck does not need to magically make generated code good. Its leverage is more practical: make review reliable, observable, cheap enough to run often, and smart enough to spend effort where risk is highest.

The next product opportunities are clear:

  1. Issue-ticket enrichment. Join Linear/GitHub issue fields so problem complexity is measured from the goal, not inferred from the PR.
  2. Agent/model provenance. Record coding agent, model, effort, context strategy, and verification behavior for the original PR attempt.
  3. Complexity-aware workflow lanes. Route low-risk PRs through fast review, complex PRs through review plus targeted fix, and high-risk PRs through human-plan or domain-owner gates.
  4. Reliability preflight. Check GitHub auth, vendor CLI auth, repo access, and tunnel state before accepting PR work.
  5. Original vs. fixed state tracking. Separate original PR quality from Crosscheck-applied fix deltas.
  6. Real cost accounting. Persist provider, model, input/output/cached tokens, duration, and computed USD per reviewer/fixer call.
  7. Post-merge quality joins. Track reverts, follow-up fixes, bug issues, incidents, and CI failures after merge.

The practice we want

AI-native engineering should not mean “let agents ship whatever they produce.” It should mean faster loops with better instrumentation.

Small goals should move quickly. Risky goals should get stronger gates. PRs should stay shaped for review. Models should be routed by expected risk, not habit. Every run should leave data behind so the workflow gets easier to improve.

That is how agentic teams can improve cadence and quality together.