Demonstrating Flaws in the Illusion of Thinking¶
Abstract¶
This notebook attempts to demonstrate a few problems with the methodology in the paper titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" (Shojaee et al. 2025).
The Tower of Hanoi experiment is replicated, and the observed collapse is reproduced with various models.
Variations of the experiment demonstrate that the complexity level where the collapse occurs is sensitive to the prompt instructions regarding the output and to the token utilization of the chosen output format.
It is also confirmed that at high complexity levels, the output quality starts to degrade early into the output, long before the token budget would be exhausted.
Revision (2025-06-18):
the collapse of the length of the thinking text also has to do with the token budget: the models realize that the output could get too long so they resort to alternative plans like using simpler, pattern based approaches or assuming user mistake and deciding to show only segments of the moves list.
Gemini 2.5 Pro Preview 06-05 thinking lengths seem to be constant across all complexity levels and experiment variants.
The raw model responses and the processed data are available on GitHub for further investigation.
Note: Others also have pointed out problems with the experiments. I was unaware of this paper when I started working on this notebook. Revision (2025-06-19): though this paper was initially intended as a joke by naming Claude Opus as the main author, it actually does point out legitimate problems with the original experiments.
Revision (2025-06-18)¶
Ran the experiments for more complexity levels around the point where the apparent collapse occurs.
Plots now include the length of the thoughts of the reasoning models.
Reasoning tokens are now archived along with the failed answers in the GitHub repository.
Introduction¶
The Illusion of Thinking (Shojaee et al. 2025) paper claims that above a certain problem complexity, Large Language Models (LLMs) and Large Reasoning Models (LRMs) collapse and fail to produce accurate solutions. The observations are based on experiments where the AI models are instructed to generate a solution to various puzzle problems with different complexity levels (e.g. Tower of Hanoi with varying number of disks, etc.). A solution is represented as a sequence of moves that are to be performed in the game environment. The moves are specified in a format similar to nested JSON lists.
I suspect there might be multiple problems with the presented method which may contribute to the observed collapse, independently from the true capabilities of the language models:
Truncated responses: preliminary investigation seems to indicate that despite the clear instructions in the system prompt, some models will often still truncate the list of moves at high complexity levels where the required number of moves exceeds certain limits. Sometimes they only provide the first and the last handful of moves, accompanied by a piece of Python or other program code to generate the rest, or by a confirmation message asking whether the user really wants to have the entire long list generated in a subsequent conversation message. An iterative, paginated or streamed approach could potentially overcome this issue, however, it could significantly increase the token costs, and it would also complicate the response processing framework.
Token wasting: moreover, partly due to the token-inefficient representation of the solution moves, some of the puzzles at high complexity levels require so many tokens (numbers, commas, spaces, brackets, etc.) that they are close to the maximum number of tokens allowed for some models and APIs, especially with the presence of chain-of-thought (CoT) reasoning tokens (thinking).
Randomness: the relatively high temperature setting ($T=1.0$), though required by some models (e.g. Anthropic Claude models), might increase the chances of the sampling process selecting a problematic token which then could have a cascading effect due to the autoregressive nature of the text generation algorithm.
Additionally, allowing models to utilize techniques that humans use when solving similarly complex puzzles with pen and paper may improve the performance of the models, while the lack of such techniques may contribute to the observed collapse:
Record keeping: e.g. in the Tower of Hanoi experiment, the models are not instructed explicitly to keep track of the game state corresponding to their moves. Solving the proposed puzzles without maintaining a written record of the game state in each step could cause a similar collapse to occur even for humans, possibly at much lower complexity levels. However, this kind of record keeping would further increase the output token count, and possibly confuse some models. Also, it would further increase the willingness of the models to truncate the output.
Mistake correction: the models are not provided with any means to revise accidental problems should they arise by a potentially problematic token after it has been added to the output sequence. Especially for models where reasoning is not in use, being instructed to regularly review the emitted moves and append corrections if necessary may improve the accuracy.
I am going to investigate the Tower of Hanoi problem, and attempt to improve the model performance past the reported collapsing complexity levels. However, the token limits and the conditioning of the models in this regard still significantly constrain the possibilities.
Models¶
claude-3-7-sonnet-20250219
by Anthropic (with and without CoT),deepseek-chat
(DeepSeek V3 as of June 2025) by DeepSeek (without CoT),gemini-2.5-pro-preview-06-05
by Google (with CoT),gpt-4.1-2025-04-14
by OpenAI (without CoT),sonar-reasoning-pro
by Perplexity AI (with CoT; powered by DeepSeek R1).
Tower of Hanoi Experiments¶
Each experiment will be repeated 10 times. (The original paper used 25 repeats.)
Baseline¶
A reproduction of the original experiment with a lower temperature for reduced randomness. Models and APIs will use a temperature of T=0.5 where possible. (Anthropic Claude models require T=1.0 when thinking is used.)
Adjusted Prompt¶
The output format remains the same as in the original experiment, but the prompt will emphasize the requirement for the complete list of moves, as well as encourage more economical output token usage (e.g. by omitting indentation and spacing).
CSV Moves¶
The moves must be specified as rows of comma-seaparated values (CSV), reducing the output token usage by eliminating bracket characters and whitespace, while maintaining automatic parsability.
State Tracking, Mistake Correction¶
The moves will be specified as rows of comma-separated values (CSV), and instead of specifying only the disk to be moved, the models will be instructed to maintain the list of disks on all three pegs after each move. This information will not be validated in its entirety, but it will be used for converting the moves into the same format as the original so that the same validations can be performed.
The models will also be provided with an error correction mechanism by introducing a sequential identifier for each move: if a CSV row is emitted with an already used identifier, then this row will overwrite any previous ones with the same sequence number, allowing the models to change their minds after emitting a wrong move.
Results¶
The baseline experiment successfully reproduced the collapse reported in the paper.
The adjusted prompt reduced the average accuracy of Sonar Reasoning Pro and the non-thinking version of Claude Sonnet 3.7, but it deferred the first invalid move for most models at high complexity levels. It also caused a noticable improvement in the performance of the Gemini 2.5 Pro Preview and the DeepSeek V3 models.
Using the more token-efficient CSV format seems to further improve the performance of most models, both in terms of accuracy and the position of the first invalid move.
Game state tracking and its increased token usage had a negative effect on the performance of all models, while the provided self-correction mechanism remained under-utilized.
Manual inspection of some of the model responses which failed to produce a complete list of moves and to achieve the goal state indicate the occurrence of different problems:
Move lists often still get truncated at high complexity levels. The emphasis put on the requirement of the move list in its entirety in the prompt often fails to achieve its goal, but encourages the modesl to include lengthy apologies for truncating the output despite the instructions.
In multiple cases, the models emit invalid moves early in their final answers at high complexity levels, long before the length of the output sequence would approach the allowed token budget. This is observed both in the presence and absence of CoT tokens.
Revision (2025-06-18): the collapse of the length of the thinking text also has to do with the token budget: the models realize that the output could get too long if they try to manually trace the algorithm (especially when they chose to take somewhat lengthy notes along each step) so they abandon that plan and resort to alternative approaches, like using simpler, pattern based approaches or assuming user mistake and deciding to show only segments of the moves list.
Gemini 2.5 Pro Preview 06-05 however seems to use a constant thinking length across all complexity levels and experiment variants.
Revision (2025-06-19): once the answer length pushes models to cut corners and use a pattern-based approach rather than risking token budge exhaustion via tracing the correct algorithm, the parity of the number of disks starts to affect the results.
Formal testing of the statistical significance of the observed performance differences, maybe on a larger sample size, is left for the reader.
The raw model responses and the processed data are available in the
GitHub repository,
in the cache
and the data
folders respectively.
try:
plot_hanoi_results(
hanoi_base_results,
hanoi_adj_results,
hanoi_csv_results,
hanoi_trck_results
)
except:
print("Run all the blocks in the Appendix: Code section first.")
Conclusion¶
It has been demonstrated that the problem complexity level where the performance collapse occurs is sensitive to the format specification of the output, and can be improved by using a more token-efficient output format. Therefore, it is not likely to be the limit of the capacity of the models, instead, the limit of model output length conditioning and token budget.
Appendix: Code¶
Dependencies¶
# !pip install matplotlib==3.10.0
# !pip install numpy==2.2.3
# !pip install pandas==2.2.3
# !pip install requests==2.32.3
import collections as coll
import collections.abc as collabc
import functools
import gzip
import hashlib
import html.parser
import json
import os
import os.path
import random
import re
import sys
import time
import typing
import urllib.parse
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
API keys¶
My personal API keys are not included in the public repository, so generating new
model responses will require setting these up. See the api-keys.json.example file
for the details. Note however that the notebook can be run without any API keys if the
cache
directory from the GitHub repository is available.
api_keys_filename = "api-keys.json"
if not os.path.isfile(api_keys_filename):
raise RuntimeError(f"API keys file not found: {api_keys_filename!r}")
with open(api_keys_filename, "r") as f:
api_keys = json.load(f)
print("API keys: " + ", ".join(sorted(api_keys.keys())))
API keys: anthropic, deepseek, google, openai, perplexity
Common Utilities¶
This block contains a convenience function for sending the same system and user prompts to all the models, as well as various cached HTTP request related utilities.
Caching all the requests and responses makes debugging and re-running the notebook easier and quicker, but sensitive and potentially sensitive data like API keys and various identifiers need to be removed from the cached data so that they are safe to be published on GitHub.
MAX_OUT_TOKENS = 64000
MAX_REASONING_TOKENS = 32000
TEMPERATURE = 0.5
MODELS = {
"sonnet": "claude-3-7-sonnet-20250219",
"deepseek": "deepseek-chat", # DeepSeek-V3 as of Jun 2025
"gemini": "gemini-2.5-pro-preview-06-05",
"gpt4": "gpt-4.1-2025-04-14",
"perplexity": "sonar-reasoning-pro",
}
MODEL_FN = {}
MODEL_R = {
"sonnet": [0, MAX_REASONING_TOKENS],
"deepseek": [0],
"gemini": [MAX_REASONING_TOKENS],
"gpt4": [0],
"perplexity": [MAX_REASONING_TOKENS],
}
def query_all(
experiment_name: str,
system_prompt: str,
user_prompt: str,
temperature: float=TEMPERATURE,
max_out_tokens: int=MAX_OUT_TOKENS,
):
for model_name, query_fn in MODEL_FN.items():
for reasoning_budget in MODEL_R[model_name]:
response, thoughts = query_fn(
experiment_name,
system_prompt,
user_prompt,
temperature,
max_out_tokens,
reasoning_budget,
)
yield MODELS[model_name], reasoning_budget, response, thoughts
def send_cached_post_request(
cache_filename: str,
url: str,
request_headers: collabc.Mapping,
request_body: collabc.Mapping,
sensitive_headers: collabc.Container=(),
sensitive_body_fields: collabc.Container=(),
):
sensitive_headers = {h.lower() for h in sensitive_headers}
sensitive_body_fields = {f.lower() for f in sensitive_body_fields}
cache_dir = os.path.dirname(cache_filename)
os.makedirs(cache_dir, exist_ok=True)
if os.path.isfile(cache_filename):
with gzip.open(cache_filename, "rt") as f:
return json.load(f)
try:
response = requests.post(url, headers=request_headers, json=request_body)
response.raise_for_status()
result = {
"request": {
"headers": del_items(request_headers, sensitive_headers),
"body": del_items(request_body, sensitive_body_fields),
},
"response": {
"headers": del_items(response.headers, sensitive_headers),
"body": del_items(response.json(), sensitive_body_fields),
}
}
with gzip.open(cache_filename, "wt", compresslevel=9) as f:
json.dump(result, f, indent=2)
return result
except Exception as exc:
print(f"Exception: ({type(exc)}) {exc}")
if hasattr(exc, "response") and exc.response is not None:
print(f"Response status code: {exc.response.status_code}")
print(f"Response body: {exc.response.text}")
raise
def build_cache_filename(experiment_name: str, model_name: str, temperature: float):
experiment_name = experiment_name.strip()
experiment_dir = os.path.dirname(experiment_name)
experiment_file = os.path.basename(experiment_name)
if experiment_dir == "":
experiment_dir = experiment_file
return os.path.join(
"cache",
experiment_dir,
(f"{experiment_file}-{model_name}-t{temperature:.3f}".replace(".", "_")) + ".json.gz",
)
def get_item(container, path: str, default=None):
"""
Extract data from nested dicts and lists based on a dot-separated
path string. See test_get_item() for examples.
"""
if path == "." or path == "":
return container
path = path.split(".")
for key in path:
if isinstance(container, collabc.Mapping):
if key in container:
container = container[key]
else:
return default
elif isinstance(container, collabc.Sequence):
if int(key) < len(container):
container = container[int(key)]
else:
return default
else:
return default
return container
def del_items(container, patterns: typing.List[str]):
"""
Return a copy of a nested dicts and lists object with the
values matching the given set of dot-separated paths removed.
The "*" character acts as a wildcard. See test_del_items()
for examples.
"""
def should_include(path: list, exclude_patterns: typing.List[tuple]) -> bool:
return not any(path_matches_pattern(path, ptrn) for ptrn in exclude_patterns)
def copy_recursive(obj, path: list, exclude_patterns: typing.List[tuple]):
if isinstance(obj, str):
return obj
if isinstance(obj, collabc.Mapping):
copy = {}
for k, v in obj.items():
path_ext = path + [k]
if should_include(path_ext, exclude_patterns):
copy[k] = copy_recursive(v, path_ext, exclude_patterns)
return copy
if isinstance(obj, collabc.Sequence):
copy = []
for k, v in enumerate(obj):
path_ext = path + [str(k)]
if should_include(path_ext, exclude_patterns):
copy.append(copy_recursive(v, path_ext, exclude_patterns))
return copy
return obj
for pattern in patterns:
if pattern == "." or pattern == "":
return ValueError(f"Invalid pattern; {pattern=!r}")
patterns = [tuple(pattern.lower().split(".")) for pattern in patterns]
return copy_recursive(container, [], patterns)
def path_matches_pattern(path: collabc.Sequence, pattern: collabc.Sequence) -> bool:
if len(path) != len(pattern):
return False
for path_component, pattern_component in zip(path, pattern):
matches = (
pattern_component == "*"
or pattern_component == path_component.lower()
)
if not matches:
return False
return True
def split_lines(text: str) -> list:
"""
Normalize line-breaks (Windows, Linux, Mac, etc.) then split
the given text into separate lines.
"""
return (
text.replace("\r\n", "\n")
.replace("\r", "\n")
.strip()
.split("\n")
)
def test_get_item():
container = {"aaa": [{"bbb": "42", "ccc": "123"}]}
assert_eq("42", get_item(container, "aaa.0.bbb"))
assert_eq(None, get_item(container, "aaa.2.zzz"))
def test_del_items():
container = {"aaa": [{"bbb": "42", "ccc": "123", "ddd": "hello"}]}
assert_eq({"aaa": [{"ddd": "hello"}]}, del_items(container, ["aaa.*.ccc", "*.*.bbb", "zzz"]))
def assert_eq(a, b):
assert a == b, f"Failed to assert that a = b; {a=!r}, {b=!r}"
test_get_item()
test_del_items()
API Clients¶
Anthropic Claude Client¶
def query_claude_sonnet(
experiment_name: str,
system_prompt: str,
user_prompt: str,
temperature: float=TEMPERATURE,
max_out_tokens: int=MAX_OUT_TOKENS,
reasoning_budget: int=MAX_REASONING_TOKENS,
):
# https://docs.anthropic.com/en/api/messages
model_name = MODELS["sonnet"]
suffix = "-nothink"
thinking = {"type": "disabled"}
# https://console.anthropic.com/settings/limits
max_out_tokens = min(64000, max_out_tokens)
if reasoning_budget > 0:
# Thinking requires temperature to be exactly 1.
temperature = 1
reasoning_budget = min(int(max_out_tokens * 0.7) + 1, reasoning_budget)
suffix = "-think"
thinking = {
"type": "enabled",
"budget_tokens": reasoning_budget,
}
cache_filename = build_cache_filename(experiment_name, model_name + suffix, temperature)
request_headers = {
"x-api-key": api_keys["anthropic"],
"anthropic-version": "2023-06-01",
"content-type": "application/json"
}
request_body = {
"model": model_name,
"max_tokens": max_out_tokens,
"temperature": temperature,
"stream": False,
"system": system_prompt,
"thinking": thinking,
"messages": [
{"role": "user", "content": user_prompt}
]
}
result = send_cached_post_request(
cache_filename,
"https://api.anthropic.com/v1/messages",
request_headers,
request_body,
sensitive_headers=["x-api-key", "anthropic-organization-id", "request-id", "CF-RAY"],
sensitive_body_fields=["id"],
)
text = None
thoughts = None
for content in get_item(result, "response.body.content"):
content_type = get_item(content, "type")
if content_type == "text":
text = content["text"]
elif content_type == "thinking":
thoughts = content["thinking"]
return text, thoughts
MODEL_FN["sonnet"] = query_claude_sonnet
DeepSeek Client¶
def query_deepseek(
experiment_name: str,
system_prompt: str,
user_prompt: str,
temperature: float=TEMPERATURE,
max_out_tokens: int=MAX_OUT_TOKENS,
reasoning_budget: int=MAX_REASONING_TOKENS,
):
# https://api-docs.deepseek.com/api/create-chat-completion
if reasoning_budget > 0:
raise NotImplementedError()
max_out_tokens = min(8192, max_out_tokens)
model_name = MODELS["deepseek"]
cache_filename = build_cache_filename(experiment_name, model_name + "-nothink", temperature)
request_headers = {
"Content-Type": "application/json",
"Authorization": "Bearer " + api_keys["deepseek"],
}
request_body = {
"model": model_name,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
"max_tokens": max_out_tokens,
"response_format": {"type": "text"},
"stream": False,
"temperature": temperature,
}
result = send_cached_post_request(
cache_filename,
"https://api.deepseek.com/chat/completions",
request_headers,
request_body,
sensitive_headers=["Authorization", "Set-Cookie", "x-ds-trace-id", "CF-RAY"],
sensitive_body_fields=["id"],
)
text = None
thoughts = None
for choice in get_item(result, "response.body.choices"):
if get_item(choice, "message.role") == "assistant":
text = get_item(choice, "message.content")
return text, thoughts
MODEL_FN["deepseek"] = query_deepseek
Google Gemini Client¶
def query_gemini(
experiment_name: str,
system_prompt: str,
user_prompt: str,
temperature: float=TEMPERATURE,
max_out_tokens: int=MAX_OUT_TOKENS,
reasoning_budget: int=MAX_REASONING_TOKENS,
system_prompt_key: str="systemInstruction",
):
# https://ai.google.dev/gemini-api/docs/text-generation
# https://ai.google.dev/api/generate-content#method:-models.generatecontent
reasoning_budget = min(32768, reasoning_budget)
max_out_tokens = max(reasoning_budget + 128, max_out_tokens)
model_name = MODELS["gemini"]
suffix = "-nothink"
thinking = {
"includeThoughts": False,
"thinkingBudget": 0,
}
if reasoning_budget > 0:
suffix = "-think"
thinking = {
"includeThoughts": True,
"thinkingBudget": reasoning_budget,
}
cache_filename = build_cache_filename(experiment_name, model_name, temperature)
request_headers = {
"Content-Type": "application/json",
}
request_body = {
system_prompt_key: {
"parts": [{"text": system_prompt}],
},
"contents": [
{"parts": [{"text": user_prompt}]},
],
"generationConfig": {
"temperature": temperature,
"maxOutputTokens": max_out_tokens,
"responseModalities": ["text"],
"thinkingConfig": thinking,
},
}
url = "".join(
(
"https://generativelanguage.googleapis.com/v1beta/models/",
urllib.parse.quote_plus(model_name),
":generateContent?key=",
urllib.parse.quote_plus(api_keys["google"]),
)
)
result = send_cached_post_request(
cache_filename,
url,
request_headers,
request_body,
sensitive_headers=[],
sensitive_body_fields=[],
)
text = None
thoughts = None
for candidate in get_item(result, "response.body.candidates"):
if get_item(candidate["content"], "role") == "model":
for part in get_item(candidate, "content.parts"):
part_text = get_item(part, "text")
if part_text is not None:
if get_item(part, "thought"):
thoughts = part_text
else:
text = part_text
return text, thoughts
MODEL_FN["gemini"] = query_gemini
As of June, 2025, some of the API documentation of Gemini uses snake_case for the system prompt field, other parts of the documentation use camelCase. The code below attempts to use both in order to see if any or both are actually accepted by the API.
print("# system_instruction:")
print(
query_gemini(
'pirate-snake_case',
"Talk like a pirate.",
"Explain in one brief sentence why the sky is blue.",
system_prompt_key="system_instruction",
)[0]
)
print("")
print("# systemInstruction:")
print(
query_gemini(
'pirate-camelCase',
"Talk like a pirate.",
"Explain in one brief sentence why the sky is blue.",
system_prompt_key="systemInstruction",
)[0]
)
# system_instruction: Arrr, the air o' this world scatters the sun's golden light and sends the blue bits a-flyin' all across the sky # systemInstruction: Arrr, the air above scatters the sun's blue light 'round the heavens more than the other colors, and that be what yer eyes catch, matey
OpenAI Client¶
def query_openai(
model_name: str,
accepts_temperature: bool,
experiment_name: str,
system_prompt: str,
user_prompt: str,
temperature: float=TEMPERATURE,
max_out_tokens: int=MAX_OUT_TOKENS,
reasoning_budget: int=MAX_REASONING_TOKENS,
):
# https://platform.openai.com/docs/guides/text?api-mode=responses
# https://platform.openai.com/docs/api-reference/responses/create
if reasoning_budget > 0:
raise NotImplementedError()
cache_filename = build_cache_filename(experiment_name, model_name + "-nothink", temperature)
request_headers = {
"Content-Type": "application/json",
"Authorization": "Bearer " + api_keys["openai"],
}
request_body = {
"model": model_name,
"max_output_tokens": max_out_tokens,
"input": [
{"role": "developer", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
"stream": False,
}
if accepts_temperature:
request_body["temperature"] = temperature
result = send_cached_post_request(
cache_filename,
"https://api.openai.com/v1/responses",
request_headers,
request_body,
sensitive_headers=["Authorization", "openai-organization", "openai-project", "x-request-id", "Set-Cookie", "CF-RAY"],
sensitive_body_fields=["id", "output.*.id"],
)
text = None
thoughts = None
for output in get_item(result, "response.body.output"):
if get_item(output, "type") == "message" and get_item(output, "role") == "assistant":
for content in get_item(output, "content", []):
if get_item(content, "type") == "output_text":
text = get_item(content, "text")
return text, thoughts
query_gpt4 = functools.partial(query_openai, MODELS["gpt4"], True)
MODEL_FN["gpt4"] = query_gpt4
Perplexity AI Client¶
def query_perplexity(
experiment_name: str,
system_prompt: str,
user_prompt: str,
temperature: float=TEMPERATURE,
max_out_tokens: int=MAX_OUT_TOKENS,
reasoning_budget: int=MAX_REASONING_TOKENS,
):
# https://docs.perplexity.ai/guides/getting-started
# https://docs.perplexity.ai/api-reference/chat-completions
model_name = MODELS["perplexity"]
cache_filename = build_cache_filename(experiment_name, model_name + "-think", temperature)
request_headers = {
"accept": "application/json",
"content-type": "application/json",
"Authorization": "Bearer " + api_keys["perplexity"],
}
request_body = {
"model": model_name,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
"max_tokens": max_out_tokens,
"temperature": temperature,
"return_related_questions": False,
"stream": False,
"web_search_options": {
"search_context_size": "low",
},
}
result = send_cached_post_request(
cache_filename,
"https://api.perplexity.ai/chat/completions",
request_headers,
request_body,
sensitive_headers=["Authorization", "Set-Cookie", "CF-RAY", ],
sensitive_body_fields=["id"],
)
text = None
thoughts = None
for choice in get_item(result, "response.body.choices"):
if get_item(choice, "message.role") == "assistant":
response = get_item(choice, "message.content").split("</think>", 1)
if len(response) == 1:
text = response[0]
elif len(response) == 2:
thoughts = response[0]
if thoughts.startswith("<think>"):
thoughts = thoughts[7:]
text = response[1]
return text, thoughts
MODEL_FN["perplexity"] = query_perplexity
Tower of Hanoi¶
def solve_tower_of_hanoi(experiment_name_tpl, system_prompt, user_prompt_tpl, temperature, repeats, parser_fn):
backlog = []
results = {
"experiment": [],
"n": [],
"i": [],
"model": [],
"reasoning_budget": [],
"temperature": [],
"response_chars": [],
"thoughts_chars": [],
"first_correction": [],
"valid_prefix_length": [],
"is_valid_solution": [],
}
for n in [14, 12, 11, 10, 9, 8, 7, 5, 3]:
for i in range(repeats):
experiment_name = os.path.join(experiment_name_tpl, f"{experiment_name_tpl}-{n}-{i}")
backlog.append((experiment_name, 0, n, i))
while len(backlog) > 0:
experiment_name, tries, n, i = backlog.pop(0)
try:
user_prompt = user_prompt_tpl.replace("$N$", str(n))
responses = query_all(experiment_name, system_prompt, user_prompt, temperature=temperature)
for model_name, reasoning_budget, response, thoughts in responses:
response = str(response)
thoughts = str(thoughts)
log_1 = f"# {model_name=!r}, {reasoning_budget=}, {tries=}, {n=}, {i=}"
print(f"# {len(backlog)=}")
print(log_1)
response_chars = len(response)
thoughts_chars = len(thoughts)
moves, first_correction = parser_fn(response)
valid_prefix_length, is_valid_solution = evaluate_moves(n, moves)
log_2 = f"# {valid_prefix_length=}, {is_valid_solution=}, {first_correction=}"
print(log_2)
print("")
if not is_valid_solution:
think = "think" if reasoning_budget > 0 else "nothink"
filename_base = f"invalid_answer-{experiment_name_tpl}-{model_name}-{think}-n{n}-i{i}-t{temperature:.1f}"
filename_base = filename_base.replace(".", "_") + ".txt"
invalid_answer_filename = os.path.join(
"data",
f"hanoi-invalid-answers-{experiment_name_tpl}",
filename_base,
)
os.makedirs(os.path.dirname(invalid_answer_filename), exist_ok=True)
with open(invalid_answer_filename, "w") as f:
print(log_1, file=f)
print(log_2, file=f)
print("", file=f)
print("<think>", file=f)
print(thoughts, file=f)
print("</think>", file=f)
print("", file=f)
print(response, file=f)
print("", file=f)
results["experiment"].append(experiment_name_tpl)
results["n"].append(n)
results["i"].append(i)
results["model"].append(model_name)
results["reasoning_budget"].append(reasoning_budget)
results["temperature"].append(temperature)
results["response_chars"].append(response_chars)
results["thoughts_chars"].append(thoughts_chars)
results["first_correction"].append(first_correction)
results["valid_prefix_length"].append(valid_prefix_length)
results["is_valid_solution"].append(is_valid_solution)
except AssertionError:
raise
except Exception as exc:
print(f" Exception ({tries=}): ({type(exc)}) {exc}")
if hasattr(exc, "response") and exc.response is not None:
print(f" Response status code: {exc.response.status_code}")
print(f" Response body: {exc.response.text}")
backlog.append((experiment_name, tries + 1, n, i))
time.sleep(max(3, min(5, tries)))
results_df = pd.DataFrame(results)
results_df.to_csv(os.path.join("data", f"{experiment_name_tpl}.csv"), index=False)
return results_df
def parse_tower_of_hanoi_moves_orig(response):
moves = []
parts = re.split(
r" *moves *= *\[ *",
response
.replace("\t", " ")
.replace("\r", "\n")
)
for part in parts[1:]:
moves = []
stack = [moves]
num = ""
for line in part.split("\n"):
line = line.split("#", 1)[0]
for char in line:
if len(stack) < 1:
break
stack_top = stack[-1]
if char == "[":
new_list = []
stack_top.append(new_list),
stack.append(new_list)
num = ""
elif char == "]":
if num != "":
stack_top.append(int(num))
num = ""
stack.pop()
elif char.isnumeric():
num += char
else:
if num != "":
stack_top.append(int(num))
num = ""
if len(stack) < 1:
break
if len(stack) > 0 and num != "":
stack_top = stack[-1]
stack_top.append(int(num))
num = ""
return moves, -1
def parse_tower_of_hanoi_moves_csv(response):
parts = response.lower().replace("\r", "\n").split("```")
moves = []
is_code_block = len(parts) != 1
for part in parts:
is_code_block = not is_code_block
if not is_code_block:
continue
moves_in_part = []
for line in part.split("\n"):
cols = [c.strip() for c in line.strip().split(",")]
if len(cols) != 3 or not all(c.strip().isnumeric() for c in cols):
continue
moves_in_part.append([int(c.strip()) for c in cols])
if len(moves_in_part) > 0:
moves = moves_in_part
return moves, -1
def parse_tower_of_hanoi_moves_csv_state(response):
parts = response.lower().replace("\r", "\n").split("```")
moves = []
first_correction = -1
is_code_block = len(parts) != 1
for part in parts:
is_code_block = not is_code_block
if not is_code_block:
continue
moves_in_part = []
for line in part.split("\n"):
cols = [c.strip() for c in line.strip().split(",")]
if len(cols) != 6 or not all(c.strip().isnumeric() for c in cols[0:3]):
continue
move_id = int(cols[0].strip()) - 1
from_peg = int(cols[1].strip())
to_peg = int(cols[2].strip())
pegs = [
[int(d.strip()) for d in c.split(" ") if d.strip().isnumeric()]
for c in cols[3:]
]
# Only using from the puzzle state what's necessary - it is meant
# mostly to help the model. However, invalid state tracking will
# produce invalid moves anyways.
disk_id = (
pegs[to_peg][-1]
if to_peg >= 0 and to_peg <= 2 and len(pegs[to_peg]) > 0
else 0
)
if move_id < len(moves_in_part):
moves_in_part[move_id] = [disk_id, from_peg, to_peg]
if first_correction == -1:
first_correction = move_id
else:
moves_in_part.append([disk_id, from_peg, to_peg])
if len(moves_in_part) > 0:
moves = moves_in_part
return moves, first_correction
def evaluate_moves(n, moves):
pegs = [
list(range(n, 0, -1)),
[],
[],
]
valid_prefix_length = 0
for move in moves:
if not isinstance(move, list):
continue
if not (isinstance(move, list) and len(move) == 3):
break
disk_id, from_peg, to_peg = move
if from_peg < 0 or from_peg > 2:
break
if to_peg < 0 or to_peg > 2:
break
if len(pegs[from_peg]) < 1:
break
if pegs[from_peg][-1] != disk_id:
break
if from_peg == to_peg:
# Nobody said explicitly to use the minimal number of moves.
continue
if len(pegs[to_peg]) > 0 and pegs[to_peg][-1] < disk_id:
break
pegs[from_peg].pop()
pegs[to_peg].append(disk_id)
valid_prefix_length += 1
is_valid_solution = (
len(pegs[0]) == 0
and len(pegs[1]) == 0
and pegs[2] == list(range(n, 0, -1))
)
return valid_prefix_length, 1 if is_valid_solution else 0
def test_parse_tower_of_hanoi_moves():
response_0 = "nope"
moves_0 = []
first_correction_0 = -1
response_1 = """\
Here's a list of moves:
```
moves = [[1, 0, 2], [2, 0, 1], [1, 2, 1], [3, 0, 2],
[1, 1, 0], [2, 1, 2], [1, 0, 99]]
```
No, wait, I made a mistake, here's a better one:
```
moves = [[1, 0, 2], [2, 0, 1], [1, 2, 1], [3, 0, 2],
[1, 1, 0], [2, 1, 2], [1, 0, 2]]
```
"""
moves_1 = [
[1, 0, 2], [2, 0, 1], [1, 2, 1], [3, 0, 2],
[1, 1, 0], [2, 1, 2], [1, 0, 2],
]
first_correction_1 = -1
response_2 = """\
```csv
move_id,from_peg,to_peg,peg0,peg1,peg2
1,0,2,3 2,,1
2,0,1,3,2,1
3,2,1,3,2 1,
4,0,2,,2 1,3
5,2,0,1,2,3
6,1,0,1,,3 2
5,1,0,1,2,3
6,1,2,1,,3 2
7,0,2,,,3 2 1
```
"""
moves_2 = [
[1, 0, 2], [2, 0, 1], [1, 2, 1], [3, 0, 2],
[1, 1, 0], [2, 1, 2], [1, 0, 2],
]
first_correction_2 = 4
response_3 = """\
```
disk_id,from_peg,to_peg
1,0,2
2,0,1
1,2,1
3,0,2
1,1,0
2,1,2
1,0,2
```
"""
moves_3 = [
[1, 0, 2], [2, 0, 1], [1, 2, 1], [3, 0, 2],
[1, 1, 0], [2, 1, 2], [1, 0, 2],
]
first_correction_3 = -1
response_4 = """\
```
move_id,from_peg,to_peg,peg0,peg1,peg2
1,0,2,14 13 12 11 10 9 8 7 6 5 4 3 2,,1
2,0,1,14 13 12 11 10 9 8 7 6 5 4 3,2,1
3,2,1,14 13 12 11 10 9 8 7 6 5 4 3,2 1,
```
"""
moves_4 = [
[1, 0, 2],
[2, 0, 1],
[1, 2, 1],
]
first_correction_4 = -1
assert_eq((moves_0, first_correction_0), parse_tower_of_hanoi_moves_orig(response_0))
assert_eq((moves_1, first_correction_1), parse_tower_of_hanoi_moves_orig(response_1))
assert_eq((moves_2, first_correction_2), parse_tower_of_hanoi_moves_csv_state(response_2))
assert_eq((moves_3, first_correction_3), parse_tower_of_hanoi_moves_csv(response_3))
assert_eq((moves_4, first_correction_4), parse_tower_of_hanoi_moves_csv_state(response_4))
def test_evaluate_moves():
moves_0 = []
moves_1 = [
[1, 0, 2], [2, 0, 1], [1, 2, 1], [3, 0, 2],
[1, 1, 0], [2, 1, 2], [1, 0, 2],
]
moves_2 = [
[1, 0, 2],
[2, 0, 1],
123, # ignored
[1, 2, 1],
[3, 0, 2],
[1, 1, 0],
[2, 1, 0],
]
moves_4 = [
[1, 0, 2],
[2, 0, 1],
[1, 2, 1],
]
assert_eq((0, False), evaluate_moves(3, moves_0))
assert_eq((7, True), evaluate_moves(3, moves_1))
assert_eq((5, False), evaluate_moves(3, moves_2))
assert_eq((3, False), evaluate_moves(14, moves_4))
test_parse_tower_of_hanoi_moves()
test_evaluate_moves()
Baseline¶
hanoi_base_sys = """\
You are a helpful assistant. Solve this puzzle for me.
There are three pegs and $n$ disks of different sizes stacked on the first peg. \
The disks are numbered from 1 (smallest) to $n$ (largest). Disk moves in this \
puzzle should follow:
1. Only one disk can be moved at a time.
2. Each move consists of taking the upper disk from one stack and placing it on \
top of another stack.
3. A larger disk may not be placed on top of a smaller disk.
The goal is to move the entire stack to the third peg.
**Example**: With 3 disks numbered 1 (smallest), 2, and 3 (largest), the initial \
state is `[[3, 2, 1], [], []]`, and a solution might be:
```
moves = [[1, 0, 2], [2, 0, 1], [1, 2, 1], [3, 0, 2],
[1, 1, 0], [2, 1, 2], [1, 0, 2]]
```
This means: Move disk 1 from peg 0 to peg 2, then move disk 2 from peg 0 to peg 1, \
and so on.
**Requirements**:
- When exploring potential solutions in your thinking process, always include the \
corresponding complete list of moves.
- The positions are 0-indexed (the leftmost peg is 0).
- Ensure your final answer includes the complete list of moves in the format:
```
moves = [[disk id, from peg, to peg], ...]
```
"""
hanoi_base_usr_tpl = """\
I have a puzzle with $N$ disks of different sizes with
**Initial configuration**:
- Peg 0: $N$ (bottom), ... 2, 1 (top)
- Peg 1: (empty)
- Peg 2: (empty)
**Goal configuration**:
- Peg 0: (empty)
- Peg 1: (empty)
- Peg 2: $N$ (bottom), ... 2, 1 (top)
**Rules**:
- Only one disk can be moved at a time.
- Only the top disk from any stack can be moved.
- A larger disk may not be placed on top of a smaller disk.
Find the sequence of moves to transform the initial configuration into the \
goal configuration.
"""
hanoi_base_results = solve_tower_of_hanoi(
"hanoi-base",
hanoi_base_sys,
hanoi_base_usr_tpl,
temperature=0.5,
repeats=10,
parser_fn=parse_tower_of_hanoi_moves_orig,
)
# len(backlog)=89 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=191, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=1279, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=67, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=1279, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=191, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=335, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=319, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=191, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=191, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=19, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=1151, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=64, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=1151, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=65 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=63, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=62 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=1407, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=63, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=447, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=59 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=128, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=95, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=70, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=57 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=16, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=17, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=41, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=831, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=64, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=42, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=64, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=32, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=49 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=48 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=388, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=48 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=47 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=447, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=46 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=45 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=43 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=21, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=42 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=41 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=319, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=63, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=39 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=128, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=47, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=38 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=64, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=37 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=32, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=36 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=32, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=35 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=34 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=10, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=32 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=95, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=31 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=47, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=30 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=32, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=95, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=23 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=64, is_valid_solution=0, first_correction=-1 # len(backlog)=22 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=22 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=19 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=18 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=17 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=17 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=16 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=12 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=10 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=10 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1
Adjusted Prompt¶
hanoi_adj_sys = """\
You are a deterministic and highly obedient assistant which outputs complete, \
untruncated solutions, regardless of length. Please solve this puzzle for me.
There are three pegs and $n$ disks of different sizes stacked on the first peg. \
The disks are numbered from 1 (smallest) to $n$ (largest). Disk moves in this \
puzzle should follow:
1. Only one disk can be moved at a time.
2. Each move consists of taking the upper disk from one stack and placing it on \
top of another stack.
3. A larger disk may not be placed on top of a smaller disk.
The goal is to move the entire stack to the third peg.
**Example**: With 3 disks numbered 1 (smallest), 2, and 3 (largest), the initial \
state is `[[3,2,1],[],[]]`, and a solution might be:
```
moves=[[1,0,2],[2,0,1],[1,2,1],[3,0,2],[1,1,0],[2,1,2],[1,0,2]]
```
This means: Move disk 1 from peg 0 to peg 2, then move disk 2 from peg 0 to peg 1, \
and so on.
**Requirements**:
- When exploring potential solutions in your thinking process, always include the \
corresponding complete list of moves.
- The positions are 0-indexed (the leftmost peg is 0).
- Ensure your final answer includes the complete list of moves in the format:
```
moves=[[disk_id,from_peg,to_peg],...]
```
Your output will be parsed automatically, so nice identation, whitespace, and \
human readability of the list don't matter. Do not truncate or summarize the output. \
Output the entire move list in full, even if it is long. This is required for task \
success.
"""
hanoi_adj_usr_tpl = """\
I have a puzzle with $N$ disks of different sizes with
**Initial configuration**:
- Peg 0: $N$ (bottom), ... 2, 1 (top)
- Peg 1: (empty)
- Peg 2: (empty)
**Goal configuration**:
- Peg 0: (empty)
- Peg 1: (empty)
- Peg 2: $N$ (bottom), ... 2, 1 (top)
**Rules**:
- Only one disk can be moved at a time.
- Only the top disk from any stack can be moved.
- A larger disk may not be placed on top of a smaller disk.
Find the sequence of moves to transform the initial configuration into the \
goal configuration.
**Note**:
- I really do want you to enumerate all the moves. Please do not truncate the list, \
however long it may be. I do insist. Do not ask for confirmation, just output the \
complete list of moves please.
- You don't have to format and indent the list nicely for readability, because \
it will be parsed automatically. Feel free to omit any whitespace from the list.
- I don't need any explanation, just the raw and complete list of moves in the \
specified format.
- This is a benchmark test where the first invalid or incomplete move you generate \
matters. The more valid and parsable moves you generate from the optimal solution, \
the better your score will be. Truncating the list in any way risks a very low \
score. Therefore your best strategy is to attempt to output the complete list, \
even if you risk running out of output tokens.
"""
hanoi_adj_results = solve_tower_of_hanoi(
"hanoi-adj",
hanoi_adj_sys,
hanoi_adj_usr_tpl,
temperature=0.1,
repeats=10,
parser_fn=parse_tower_of_hanoi_moves_orig,
)
# len(backlog)=89 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=1279, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=735, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=2559, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=1791, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=1791, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=2559, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=2559, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=95, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=55, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=2559, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=2559, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=2559, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=2559, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=1919, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=55, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=22, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=1919, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=319, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=1919, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=22, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=66 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=160, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=1919, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=64 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=63 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=1919, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=62 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=1919, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=61 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=60 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=59 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=57 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=55, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=56 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=55 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=54 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=53 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=52 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=1023, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=50 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=49 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=49 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=49 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=48 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=48 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=48 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=47 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=47 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=47 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=47 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=46 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=46 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=46 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=46 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=45 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=45 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=45 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=43 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=43 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=43 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=42 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=42 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=42 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=41 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=40 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=40 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=40 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=39 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=38 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=37 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=21, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=36 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=35 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=34 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=175, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=33 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=191, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=32 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=32 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=31 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=30 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=22, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=27 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=22, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=23 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=22 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=19 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=18 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=17 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=16 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=16 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=15 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=14 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=13 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=12 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=11 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=10 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1
CSV Moves¶
hanoi_csv_sys = """\
You are a deterministic and highly obedient assistant which outputs complete, \
untruncated solutions, regardless of length. Please solve this puzzle for me.
There are three pegs and $n$ disks of different sizes stacked on the first peg. \
The disks are numbered from 1 (smallest) to $n$ (largest). Disk moves in this \
puzzle should follow:
1. Only one disk can be moved at a time.
2. Each move consists of taking the upper disk from one stack and placing it on \
top of another stack.
3. A larger disk may not be placed on top of a smaller disk.
The goal is to move the entire stack to the third peg.
**Example**: With 3 disks numbered 1 (smallest), 2, and 3 (largest), the initial \
state is `[3,2,1],[],[]`, and a solution might be:
```
disk_id,from_peg,to_peg
1,0,2
2,0,1
1,2,1
3,0,2
1,1,0
2,1,2
1,0,2
```
This means: Move disk 1 from peg 0 to peg 2, then move disk 2 from peg 0 to peg 1, \
and so on.
**Requirements**:
- When exploring potential solutions in your thinking process, always include the \
corresponding complete list of moves.
- The positions are 0-indexed (the leftmost peg is 0).
- Ensure your final answer includes the complete CSV list of moves with the \
following columns:
```
disk_id,from_peg,to_peg
```
The meaning of the columns:
- **disk_id**: the number of the disk that is being moved. (Goes from 1 to $n$.)
- **from_peg**: from which peg the disk is being removed. (0, 1, or 2.)
- **to_peg**: to which peg the disk is being placed. (0, 1, or 2.)
Your output will be parsed automatically. Do not truncate or summarize the moves. \
Output the entire move list in full, even if it is long. This is required for task \
success.
"""
hanoi_csv_usr_tpl = """\
I have a puzzle with $N$ disks of different sizes with
**Initial configuration**:
- Peg 0: $N$ (bottom), ... 2, 1 (top)
- Peg 1: (empty)
- Peg 2: (empty)
**Goal configuration**:
- Peg 0: (empty)
- Peg 1: (empty)
- Peg 2: $N$ (bottom), ... 2, 1 (top)
**Rules**:
- Only one disk can be moved at a time.
- Only the top disk from any stack can be moved.
- A larger disk may not be placed on top of a smaller disk.
Find the sequence of moves to transform the initial configuration into the \
goal configuration.
**Note**:
- I really do want you to enumerate all the moves. Please do not truncate the list, \
however long it may be. I do insist. Do not ask for confirmation, just output the \
complete list of moves please.
- I don't need any explanation, just the raw and complete list of moves in the \
specified format.
- This is a benchmark test where the first invalid move you generate matters. \
The more valid and parsable moves you generate from the optimal solution, the \
better your score will be. Truncating the list in any way risks a low score. \
Therefore your best strategy is to attempt to output the complete list, even \
if you risk running out of output tokens.
"""
hanoi_csv_results = solve_tower_of_hanoi(
"hanoi-csv",
hanoi_csv_sys,
hanoi_csv_usr_tpl,
temperature=0.1,
repeats=10,
parser_fn=parse_tower_of_hanoi_moves_csv,
)
# len(backlog)=89 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=512, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=512, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=21, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=4095, is_valid_solution=1, first_correction=-1 # len(backlog)=79 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=671, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=4095, is_valid_solution=1, first_correction=-1 # len(backlog)=78 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=3583, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=1359, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=4095, is_valid_solution=1, first_correction=-1 # len(backlog)=75 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=447, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=1535, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=69 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=68 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=67 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=66 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=65 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=65 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=64 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=895, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=223, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=62 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=447, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=61 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=703, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=447, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=2047, is_valid_solution=1, first_correction=-1 # len(backlog)=60 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=287, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=59 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=319, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=447, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=383, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=56 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=319, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=47, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=55 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=21, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=639, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=47, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=53 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=52 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=767, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=51 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=1023, is_valid_solution=1, first_correction=-1 # len(backlog)=50 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=511, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=49 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=48 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=447, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=48 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=48 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=47 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=47 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=46 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=45 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=45 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=45 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=45 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=44 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=43 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=43 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=43 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=42 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=42 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=42 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=41 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=41 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=351, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=479, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=40 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=511, is_valid_solution=1, first_correction=-1 # len(backlog)=40 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=79, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=39 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=38 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=38 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=87, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=37 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=37 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=36 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=35 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=35 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=34 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=179, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=34 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=33 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=151, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=32 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=32 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=31 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=31 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=30 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=255, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=255, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=27 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=23 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=22 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=20 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=19 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=18 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=17 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=16 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=15 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=14 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=13 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=12 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=11 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=10 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1
State Tracking, Mistake Correction¶
hanoi_trck_sys = """\
You are a deterministic and highly obedient assistant which outputs complete, \
untruncated solutions, regardless of length. Please solve this puzzle for me.
There are three pegs and $n$ disks of different sizes stacked on the first peg. \
The disks are numbered from 1 (smallest) to $n$ (largest). Disk moves in this \
puzzle should follow:
1. Only one disk can be moved at a time.
2. Each move consists of taking the upper disk from one stack and placing it on \
top of another stack.
3. A larger disk may not be placed on top of a smaller disk.
The goal is to move the entire stack to the third peg.
**Example**: With 3 disks numbered 1 (smallest), 2, and 3 (largest), the initial \
state is `[3, 2, 1], [], []`, and a solution might be:
```
move_id,from_peg,to_peg,peg0,peg1,peg2
1,0,2,3 2,,1
2,0,1,3,2,1
3,2,1,3,2 1,
4,0,2,,2 1,3
5,1,0,1,2,3
6,1,2,1,,3 2
7,0,2,,,3 2 1
```
This means: Move disk 1 from peg 0 to peg 2, then move disk 2 from peg 0 to peg 1, \
and so on. The peg0, peg1, and peg2 columns represent the state of each peg after \
the move.
**Requirements**:
- When exploring potential solutions in your thinking process, always include the \
corresponding complete list of moves.
- The positions are 0-indexed (the leftmost peg is 0).
- Ensure your final answer includes the complete list of moves and the resulting \
peg states in the format:
```
move_id,from_peg,to_peg,peg0,peg1,peg2
```
The meaning of the columns:
- **move_id**: the ordinal number of the move starting from 1. For each move id, only \
its last instance is considered, so if you notice any previous mistakes while generating \
the list, you can fix the incorrect rows anytime by outputting the same move ids again \
with the corrected values.
- **from_peg**: from which peg the disk is being removed. (0, 1, or 2.)
- **to_peg**: to which peg the disk is being placed. (0, 1, or 2.)
- **peg0**: the state of the first peg after the move, represented as a space separated \
list of disk numbers going from the bottom to the top.
- **peg1**: the state of the second peg after the move, with the same format as above.
- **peg2**: the state of the third peg after the move, with the same format as above.
**Example with error correction**:
```
move_id,from_peg,to_peg,peg0,peg1,peg2
1,0,2,3 2,,1
2,0,1,3,2,1
3,2,1,3,2 1,
4,0,2,,2 1,3
5,2,0,1,2,3
6,1,0,1,,3 2
5,1,0,1,2,3
6,1,2,1,,3 2
7,0,2,,,3 2 1
```
Your output will be parsed automatically. Make sure that you strictly stick to the \
specified format. Regularly check if a correction is necessary, and if so, output the \
appropriate correction rows. Do not truncate or summarize the output. Output the \
entire move and state list in full, even if it is long. This is required for task \
success.
"""
hanoi_trck_usr_tpl = """\
I have a puzzle with $N$ disks of different sizes with
**Initial configuration**:
- Peg 0: $N$ (bottom), ... 2, 1 (top)
- Peg 1: (empty)
- Peg 2: (empty)
**Goal configuration**:
- Peg 0: (empty)
- Peg 1: (empty)
- Peg 2: $N$ (bottom), ... 2, 1 (top)
**Rules**:
- Only one disk can be moved at a time.
- Only the top disk from any stack can be moved.
- A larger disk may not be placed on top of a smaller disk.
Find the sequence of moves to transform the initial configuration into the \
goal configuration.
**Note**:
- I really do want you to enumerate all the moves. Please do not truncate the list, \
however long it may be. I do insist. Do not ask for confirmation, just output the \
complete list of moves please.
- I don't need any explanation, just the raw and complete list of moves in the \
specified format.
- This is a benchmark test where the first invalid move you generate matters (not \
counting those that you fix subsequently). The more valid and parsable moves you generate \
from the optimal solution, the better your score will be. Truncating the list in any \
way risks a low score. Therefore your best strategy is to attempt to output the \
complete list, even if you risk running out of output tokens.
"""
hanoi_trck_results = solve_tower_of_hanoi(
"hanoi-trck",
hanoi_trck_sys,
hanoi_trck_usr_tpl,
temperature=0.1,
repeats=10,
parser_fn=parse_tower_of_hanoi_moves_csv_state,
)
# len(backlog)=89 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=14, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=89 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=7, is_valid_solution=0, first_correction=444 # len(backlog)=88 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=1 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=88 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=1 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=31, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=111, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=2 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=87 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=2 # valid_prefix_length=19, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=2, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=86 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=3 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=2, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=85 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=4 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=22, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=5 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=84 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=83 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=82 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=7 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=21, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=81 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=14, i=9 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=80 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=14, i=9 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=0 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=79 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=1 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=78 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=12, is_valid_solution=0, first_correction=50 # len(backlog)=77 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=2 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=77 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=2 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=3 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=76 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=3 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=2, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=17, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=75 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=4 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=2, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=19, is_valid_solution=0, first_correction=351 # len(backlog)=74 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=5 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=74 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=5 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=73 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=6 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=95, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=72 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=71 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=8 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=12, i=9 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=70 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=12, i=9 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=0 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=69 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=0 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=1 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=68 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=1 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=2 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=67 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=18, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=3 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=66 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=5, is_valid_solution=0, first_correction=11 # len(backlog)=65 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=14, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=65 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=4 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=47, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=5 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=64 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=15, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=6 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=63 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=6 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=192, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=7 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=62 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=61 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=8 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=21, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=11, i=9 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=60 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=11, i=9 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=0 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=59 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=0 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=2, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=1 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=58 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=1 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=2 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=57 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=2 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=3 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=56 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=3 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=4 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=55 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=4 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=5 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=54 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=17, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=53 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=52 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=51 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=2, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=14, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=10, i=9 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=50 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=10, i=9 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=19, is_valid_solution=0, first_correction=62 # len(backlog)=49 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=17, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=0 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=49 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=0 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=95, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=1 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=48 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=1 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=43, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=2 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=47 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=2 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=94, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=3 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=46 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=3 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=8, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=14, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=45 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=4 # valid_prefix_length=19, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=5 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=44 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=5 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=43 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=6 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=20, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=159, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=42 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=14, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=41 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=8 # valid_prefix_length=5, is_valid_solution=0, first_correction=4 # len(backlog)=40 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=9, i=9 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=40 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=9, i=9 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=2, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=0 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=39 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=0 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=35, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=12, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=1 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=38 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=1 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=46, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=2 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=37 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=39, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=36 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=3 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=25, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=4 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=35 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=4 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=19, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=132, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=34 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=5 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=6 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=33 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=6 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=18, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=7 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=32 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=7 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=2, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=31 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=23, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=127, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=8, i=9 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=30 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=8, i=9 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=5, is_valid_solution=0, first_correction=4 # len(backlog)=29 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=29 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=29 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=0 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=1, is_valid_solution=0, first_correction=4 # len(backlog)=28 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=28 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=1 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=28 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=1 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=27 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=5, is_valid_solution=0, first_correction=4 # len(backlog)=27 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=27 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=27 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=2 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=27 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=2 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=51, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=26 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=26 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=3 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=5, is_valid_solution=0, first_correction=4 # len(backlog)=25 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=25 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=4 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=25 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=4 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=5, is_valid_solution=0, first_correction=4 # len(backlog)=24 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=24 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=5 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=24 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=5 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=23 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=5, is_valid_solution=0, first_correction=4 # len(backlog)=23 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=29, is_valid_solution=0, first_correction=-1 # len(backlog)=23 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=23 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=23 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=6 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=23 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=6 # valid_prefix_length=1, is_valid_solution=0, first_correction=-1 # len(backlog)=22 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=5, is_valid_solution=0, first_correction=4 # len(backlog)=22 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=29, is_valid_solution=0, first_correction=-1 # len(backlog)=22 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=22 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=22 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=7 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=22 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=7 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=5, is_valid_solution=0, first_correction=4 # len(backlog)=21 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=127, is_valid_solution=1, first_correction=-1 # len(backlog)=21 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=8 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=21 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=8 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=47, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=7, i=9 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=20 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=7, i=9 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=19 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=19 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=19 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=19 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=0 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=19 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=0 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=18 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=18 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=18 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=1 # valid_prefix_length=11, is_valid_solution=0, first_correction=-1 # len(backlog)=18 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=1 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=17 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=17 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=17 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=17 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=2 # valid_prefix_length=7, is_valid_solution=0, first_correction=-1 # len(backlog)=17 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=2 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=16 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=16 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=16 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=16 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=3 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=16 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=3 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=15 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=15 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=15 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=15 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=4 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=15 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=4 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=0, is_valid_solution=0, first_correction=-1 # len(backlog)=14 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=14 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=14 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=5 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=14 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=5 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=13 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=13 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=13 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=6 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=13 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=6 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=4, is_valid_solution=0, first_correction=-1 # len(backlog)=12 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=12 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=12 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=7 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=12 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=7 # valid_prefix_length=9, is_valid_solution=0, first_correction=-1 # len(backlog)=11 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=11 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=11 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=11 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=8 # valid_prefix_length=6, is_valid_solution=0, first_correction=-1 # len(backlog)=11 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=8 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=10 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=5, is_valid_solution=0, first_correction=-1 # len(backlog)=10 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=3, is_valid_solution=0, first_correction=-1 # len(backlog)=10 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=31, is_valid_solution=1, first_correction=-1 # len(backlog)=10 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=5, i=9 # valid_prefix_length=13, is_valid_solution=0, first_correction=-1 # len(backlog)=10 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=5, i=9 # valid_prefix_length=27, is_valid_solution=0, first_correction=4 # len(backlog)=9 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=9 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=0 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=8 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=1 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=7 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=2 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=6 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=3 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=5 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=4 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=4 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=5 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=3 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=6 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=2 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=7 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=1 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=8 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='claude-3-7-sonnet-20250219', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='deepseek-chat', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1 # len(backlog)=0 # model_name='sonar-reasoning-pro', reasoning_budget=32000, tries=0, n=3, i=9 # valid_prefix_length=7, is_valid_solution=1, first_correction=-1
Plotting¶
def plot_hanoi_results(base_results_df, adj_results_df, csv_results_df, trck_results_df):
for i, (title, results_df) in enumerate(
(
("Baseline", base_results_df),
("Adjusted", adj_results_df),
("CSV", csv_results_df),
("State, Corr.", trck_results_df),
)
):
plot_cols = ["mean_accuracy", "log_mean_valid_prefix_length", "log_mean_thoughts_chars"]
plot_height = 2.7
if results_df["first_correction"].max() > -1.0:
plot_cols.append("mean_first_correction")
plot_height = 5
axs = plt.subplots(1, len(plot_cols), figsize=(14, plot_height))[1]
scores = (
results_df
.groupby(["experiment", "n", "model", "reasoning_budget", "temperature"])
.mean()
.reset_index()
.rename(
columns={
"is_valid_solution": "mean_accuracy",
"valid_prefix_length": "mean_valid_prefix_length",
"first_correction": "mean_first_correction",
"thoughts_chars": "mean_thoughts_chars",
},
)
)
scores["log_mean_valid_prefix_length"] = np.log1p(scores["mean_valid_prefix_length"])
scores["log_mean_thoughts_chars"] = np.log1p(scores["mean_thoughts_chars"])
for j, col in enumerate(plot_cols):
ax = axs[j]
for model in sorted(scores["model"].unique()):
model_scores = scores[scores["model"] == model]
for reasoning_budget in sorted(model_scores["reasoning_budget"].unique()):
model_variant_scores = model_scores[model_scores["reasoning_budget"] == reasoning_budget]
suffix = "_think" if reasoning_budget > 0 else "_nothink"
x = []
y = []
for n in sorted(model_variant_scores["n"].unique()):
x.append(n)
y.append(model_variant_scores[model_variant_scores["n"] == n][col].mean())
ax.plot(x, y, label=f"{model}{suffix}")
ax.set_xticks(x)
ax.set_title(f"{title} ({col})")
ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.2))
plt.tight_layout()
plt.show()