Task Success Rate (TSR)
Task Success Rate measures the percentage of test items where a reasoning model reaches the correct final outcome under clearly defined success criteria. In a reasoning AI test, success should be evaluated against deterministic answers (math, logic), rubric-scored solutions (planning, analysis), or verifiable actions (tool calls, executed steps). TSR is most valuable when segmented by difficulty tiers, domain categories, and required reasoning depth (single-step vs multi-step). For rigorous benchmarking, define “success” at multiple granularities: final answer correct, intermediate constraints satisfied, and solution format adherence. Track TSR under standard conditions and under perturbations (noisy inputs, distracting facts) to quantify robustness.
Exact Match and Structured Output Accuracy
Exact Match is strict string-level correctness, essential for tasks where formatting is part of the requirement: code outputs, JSON schemas, symbolic proofs, or equation forms. For structured outputs, compute schema validity (parsable JSON, correct keys, types), field-level accuracy, and constraint compliance (e.g., units included, citations present). When multiple correct surface forms exist, normalize outputs (case, whitespace, canonical ordering) or use equivalence checks (AST comparison for code, symbolic simplification for math). This metric prevents models from “nearly” answering while still failing real integration requirements.
Reasoning Step Validity (Intermediate Accuracy)
Reasoning Step Validity evaluates whether intermediate steps are logically sound, not just whether the final answer is correct. Score each step against a reference chain-of-thought, a set of accepted transformations, or a verifier that checks entailment between steps. This helps detect lucky guesses, shallow heuristics, or brittle shortcuts. In multi-hop QA, track hop-level correctness (retrieved fact A supports inference B) and compute a step-level F1. In planning, validate preconditions/effects per action. Step validity correlates strongly with reliability in high-stakes settings where process integrity matters.
Calibration and Confidence Quality
Calibration measures how well predicted confidence aligns with actual correctness. Use Expected Calibration Error (ECE), Brier score, and reliability diagrams on bins of predicted probability. A reasoning AI test should also report selective accuracy: accuracy when the model answers only above a confidence threshold, plus coverage (fraction answered). Well-calibrated systems enable safer deployment via deferrals, human review, or tool-assisted verification. Include calibration per domain and difficulty; models are often overconfident in adversarial logic puzzles and underconfident in routine arithmetic.
Consistency Under Rephrasing and Permutations
Reasoning models should remain stable when the same problem is presented with paraphrases, reordered premises, or equivalent representations. Measure consistency as agreement rate across variants, and compute worst-case accuracy across paraphrase sets. For logical tasks, permute the order of facts; for math word problems, rephrase context; for code reasoning, rename variables. High consistency indicates the model is using underlying structure rather than memorized cues. Combine this with a variance metric to identify prompts where performance swings widely.
Robustness to Adversarial and Distractor Inputs
Robustness metrics quantify performance under targeted stress: irrelevant distractors, misleading statements, typos, and adversarially crafted contradictions. Report robustness drop (Δ accuracy) from clean to perturbed sets and stratify by perturbation type. For reasoning AI, include “distractor density” (number of irrelevant facts) and “contradiction sensitivity” (whether the model detects inconsistency and appropriately abstains or flags uncertainty). Robustness directly measures real-world resilience, where inputs are messy and may contain deceptive cues.
Faithfulness and Grounding Score
Faithfulness measures whether the model’s reasoning and answers are grounded in provided evidence rather than hallucinated facts. In retrieval-augmented tests, compute citation precision/recall: cited passages that actually support claims, and necessary supporting passages that were cited. For non-retrieval tasks with given premises, use entailment-based scoring to check if statements are supported by the input. Track unsupported-claim rate and severity-weighted hallucination score (minor detail vs critical error). Faithfulness is essential when reasoning depends on specific constraints.
Tool-Use Accuracy and Action Validity
Many reasoning systems rely on tools (calculators, search, code execution). Tool-use metrics assess whether the model selects the right tool, issues correct calls, and correctly interprets results. Measure tool selection accuracy, call success rate, parameter correctness, and post-tool answer correctness. For multi-tool chains, compute action validity per step and end-to-end toolchain success. Also track “tool avoidance errors” (fails because it refused to use an available tool) and “over-tooling” (unnecessary calls increasing latency/cost without accuracy gains).
Efficiency: Latency, Token Economy, and Reasoning Cost
Performance is not only correctness; it is also efficiency. Measure time-to-first-token, end-to-end latency, tokens generated, and compute cost per solved task. For agentic reasoning, track number of steps, tool calls, and backtracks. Report accuracy-at-budget curves: how accuracy changes under token caps or time limits. An effective reasoning model should deliver strong results within operational constraints. Efficiency metrics help compare models that reach similar accuracy but differ substantially in speed and cost.
Generalization Across Domains and Difficulty
A strong reasoning AI should generalize beyond narrow benchmarks. Measure cross-domain performance by holding out entire task families (e.g., syllogisms, combinatorics, causal reasoning) and reporting macro-averaged accuracy to prevent dominant categories from masking weaknesses. Use difficulty-conditioned metrics such as accuracy by depth (number of inference steps), by abstraction level, and by compositionality (novel combinations of familiar primitives). Report worst-domain accuracy and tail performance on the hardest decile to understand failure modes.
Error Taxonomy and Severity-Weighted Scoring
Raw accuracy hides what went wrong. Build an error taxonomy: arithmetic slip, invalid inference, misread constraint, missing edge case, contradiction oversight, tool misinterpretation, and format violation. Score errors with severity weights aligned to application risk. For example, a minor formatting error differs from a logically invalid conclusion. Track frequency and severity-weighted loss, enabling targeted improvements. Pair this with confusion matrices for categorical reasoning tasks and per-rule breakdowns for logic suites.
Human Alignment: Helpfulness, Harmlessness, and Refusal Quality
Reasoning tests often include safety-relevant prompts. Measure refusal precision (refuse when required) and refusal recall (refuse all disallowed requests), plus “over-refusal rate” on benign prompts. Evaluate helpfulness on allowed tasks using rubrics for completeness, correctness, and clarity. For borderline cases, score refusal quality: does the model explain constraints, offer safe alternatives, and avoid leaking actionable harm? Alignment metrics ensure that stronger reasoning does not amplify unsafe capability.
Statistical Reliability: Confidence Intervals and Significance
Benchmark results should include uncertainty. Report 95% confidence intervals via bootstrap resampling and conduct significance tests for model comparisons. Track inter-rater reliability (Cohen’s kappa, Krippendorff’s alpha) when human grading is used, especially for open-ended reasoning. Provide dataset-level and slice-level intervals to avoid overinterpreting small gains. Statistical reliability metrics make performance claims credible and prevent optimization on noise.
Composite Scorecards and Pareto Fronts
Because reasoning AI is multi-objective, combine metrics into scorecards rather than a single number. Use weighted composites aligned to business goals, but also present Pareto fronts showing trade-offs among accuracy, faithfulness, latency, and safety. Include minimum thresholds (e.g., schema validity must exceed 99%) and report constraint violations. Composite evaluation encourages balanced optimization, ensuring improvements in reasoning do not degrade robustness, grounding, or operational efficiency.
