Agent Bench Results

Successful benchmark runs — sorted by most recent first

No chart data for the current filters.
Scoring Model
Score = clamp(0, 100 − iteration_penalty − duration_penalty − token_penalty, 100)

// Penalty weights
iteration_penalty = iterations × 5
duration_penalty  = duration_secs × 0.5
token_penalty     = tokens_used × 0.0001
Factor Weight Rationale
iterations -5 / iter Fewer iterations means the model solved the task more directly
duration_secs -0.5 / sec Faster completion indicates better efficiency
tokens_used -0.0001 / tok Lower token usage reflects more concise reasoning
High (≥ 70) Mid (≥ 40) Low (< 40)
Task Model Iterations Duration (s) Tokens Score Version Timestamp