Successful benchmark runs — sorted by most recent first
| Factor | Weight | Rationale |
|---|---|---|
| iterations | -5 / iter | Fewer iterations means the model solved the task more directly |
| duration_secs | -0.5 / sec | Faster completion indicates better efficiency |
| tokens_used | -0.0001 / tok | Lower token usage reflects more concise reasoning |
| Task | Model | Iterations | Duration (s) | Tokens | Score | Version | Timestamp |
|---|