Agent Bench Results

Score = clamp(0, 100 − iteration_penalty − duration_penalty − token_penalty, 100)

// Penalty weights
iteration_penalty = iterations × 5
duration_penalty = duration_secs × 0.5
token_penalty = tokens_used × 0.0001

Factor	Weight	Rationale
iterations	-5 / iter	Fewer iterations means the model solved the task more directly
duration_secs	-0.5 / sec	Faster completion indicates better efficiency
tokens_used	-0.0001 / tok	Lower token usage reflects more concise reasoning

High (≥ 70) Mid (≥ 40) Low (< 40)