Game Agent Coding League for LLMs

Overview

Game Agent Coding League (GACL) is a benchmarking framework designed for LLMs in which models are tasked with generating code for game-playing agents. These agents compete in games such as Battleship, Connect4 and others. At present, the league supports five games, with additional titles planned.

Models produce code that can play games rather than playing the games themselves. Their sole interaction occurs at the stage of code creation. As a result, this benchmark is hybrid in nature, combining elements of both code generation and reasoning.

Methodology

For each game, a base implementation is provided, including the core game logic and an initialized—but empty—agent class. The function signature is shared with the model but the implementation is left empty. Models receive this code along with the game rules, scoring details, and other relevant information. After a model generates its agent implementation, the output is parsed and inserted back into the game environment. A matchmaking system then schedules matches between all agent pairs, producing a final leaderboard.

Each model is prompted four times, meaning every model submits four separate agents. Before the real matchmaking, the best two agents from each model are selected (using --mini option in the matchmaking script to conduct a mini tournament). Others are discarded. During matchmaking, agents from the same model do not compete against each other (remember, there are two agents from each model). Instead, each agent plays against all non-friendly agents.

When compiling performance charts, only the stronger of a model's two agents is considered; the weaker one is excluded. You can find links to the raw scoreboards—where both agents appear—on the Leaderboard page.

Point System

Each matchup between two agents is made up of multiple individual games, all using the same point allocation system. A win awards 3 points, a loss gives 0, and a draw grants 1 point to each side. The way a win, loss, or draw is determined varies by game—refer to the full game prompts for detailed scoring mechanics.

Exception: A3 Wizard game is different from other games. A match of this game needs 6 participants. So, there is no "Win/Draw/Lose" cases. Instead, the 1st agent grabs 5 points, the 2nd agent grabs 4, and so on. The last agent grabs nothing.

Performance is tracked at the game level, not the match level. If a model is listed as having played "10" games, that refers to 10 individual games—not 10 matches.

Example: Suppose Mistral's first agent plays a match against Qwen's second agent, and that match consists of 100 games. The leaderboard is updated based on the results of each of those 100 games separately, with all points recorded on a per-game basis.

Unified Model Source

All models are accessed through the OpenRouter API, providing a single, consistent interface across providers. When multiple providers serve the same model, we select only those that support fp8 quantization or better, ensuring inference quality is not a confounding variable in the results.

Overall Performance

First, we calculate normalized points (0–100) for each game. That is found by dividing the points the model collected by the max points that can be collected in all games.

Then, we calculate the overall points shown on the overall tab. Each game carries a specific weighting coefficient. The overall points are calculated by combining each game's normalized points, multiplied by its weight, divided by the total weight of games the model participated in.

Additionally, a separate "Score" is calculated differently for each game (check the full game prompts for more info). Although we keep track of this score as a secondary sorting metric on the leaderboards, it does not affect anything else and has no effect on the points a model collects.

Unexpected Underperformance

Do you think one of the models performed significantly worse than expected in a particular game? That might simply be due to chance. To address this, we’ll request updated agent code from each model for every game on a biweekly basis. Then, the lowest-performing agent (among the existing three) will be eliminated. After making these adjustments, we’ll run the evaluation again. Ideally, the new agent will replace the one that underperformed unexpectedly.

Also note that when building the leaderboards and charts, we only consider the top-performing agent from each model (out of two agents). So there is already a safeguard in place to mitigate this issue.