Obfuscation Improves AI Code Comprehension?!¶

Abstract¶

  • State of the art LLMs and LRMs are tasked with explaining code snippets that implement various algorithms. In order to prevent the models from going by surface level and structural similarities to the usual textbook-like versions of the algorithms which they may have encountered in their training corpora, the code examples employ various overcomplications, e.g. using Newton's method for calculating a square root or Euclid's algorithm for checking divisibility. It is demonstrated that in many cases, AI assistants (especially non-reasoning models) struggle with correctly interpreting the presented code snippets.

  • The exact same code snippets are processed again with the same AI models and prompts, but this time all function and variable names are replaced with single letters which hold no information about their intended purpose or usage.

  • The lack of descriptive names forces the AI models to analyze the code more thoroughly, often resulting in more accurate explanations.

Introduction¶

Does Artificial Intelligence (AI) understand software source code by reading it thoroughly, or does it just skim through the text and use statistics, surface level similarities to training corpora, maybe educated guessing to assess how it's supposed to work, what it's supposed to do, and whether there are any bugs in it?

Can surface level, syntactic obfuscation improve the source code comprehension of AI models by forcing them to carefully analyze the code's structure and its formulas, and consequently, reducing their likelihood to hallucinate non-existing bugs or forgetting to take important behaviors into account?

Exploring questions like these is becoming more and more important with the increasing use of AI in software development.

Experiments¶

State of the art large language models (LLMs) and large reasoning models (LRMs) will be tasked with understanding and explaining Python implementations of various algorithms and calculations:

  • longest palindromic substring,
  • twin primes,
  • statistical standardization,
  • HighLife, a cellular automaton which is a variant of Conway's Game of Life,
  • and an approximation of $\pi$.

In order to force the models to actually reason about the code (rather than pattern-matching and reciting explanations which they may have memorized from their respective training corpora), the implementations will employ various modifications compared to their standard, textbook-like versions. These modifications may affect their computational efficiency and their time complexity, but not their correctness, as demonstrated by unit tests.

On top of these semantic obfuscations, the implementation will be presented to the AIs in three variants:

  1. Meaningful names: descriptive function and variable names which reflect their purpose and intended usage throughout the algorithm.
  2. Meaningless names: single letter function and variable names which have nothing to do with their respective purpose and usage. The main function is called f(), parameters and local variables take their name from the alphabet in order of appearance.
  3. Ignore names: descriptive function and variable names which reflect their purpose and intended usage throughout the algorithm, but the AI is asked not to rely on them due to their potentially misleading nature.

After their detailed explanations, the models will summarize their understanding in a single sentence which will be matched against patterns that indicate a correct understanding and against patterns which indicate misunderstanding. An explanation will be considered correct if the summary matches the acceptance pattern and does not match any of the misunderstanding patterns.

Due to their complexity, the explanations to the $\pi$ approximation algorithm will be classified blindly into 3 categories by GPT 4.1:

  1. Approximates $\pi$: the code is supposed to approximate $\pi$ and it does so.
  2. Approximates $\frac{1}{4}\pi$: the code is supposed to approximate $\frac{1}{4}\pi$ and it does so (however, the name of the function may be unclear about this).
  3. The code contains a bug: the code is not doing what it is supposed to do due to a programming mistake.

In order to avoid self-enhancement bias, the evaluator will not be informed about which category corresponds to correct answers, and it will not be provided with the evaluated code either, instead, its task will be merely to classify each answer into one of the above 3 categories. Category 1 answers will be considered accurate, and the rest will be considered inaccurate.

Each experiment will be repeated 10 times.

Models¶

  • claude-opus-4-20250514 by Anthropic (with and without CoT reasoning),
  • deepseek-chat (DeepSeek-V3-0324 as of July 2025) by DeepSeek (without CoT reasoning),
  • gemini-2.5-pro-preview-06-05 by Google (with CoT reasoning),
  • gpt-4.1-2025-04-14 by OpenAI (without CoT reasoning),
  • sonar-reasoning-pro by Perplexity AI (with CoT reasoning; powered by DeepSeek R1).

Results¶

  • Regardless of the semantic and syntactic obfuscations, all models were able to correctly summarize the Longest Palindromic Substring algorithm.

  • Claude Opus 4 with reasoning turned off had trouble understanding the semantically obfuscated Twin Primes and the HighLife algorithms, usually due to suspected bugs, but the model's accuracy improved dramatically when both semantic and syntactic obfuscations were in effect. The rest of the models performed quite well, with only occasional mistakes.

  • The semantic obfuscations in the Statistical Standardization calculation successfully made all non-reasoning models suspect bugs in the formulas, but they regained high accuracy when both semantic and syntactic obfuscations came into effect.

  • The semantic obfuscations in the $\pi$ Approximation algorithm were buried deep in the math, and they caused several misunderstandings. A common theme was that the algorithm would approximate $\frac{1}{4}\pi$ instead of $\pi$, which was sometimes assumed to be the intended behavior of the code, other times it was reported as a bug.

    Despite the semantic obfuscations, DeepSeek-V3-0324 performed surprisingly well in this challenge when the code was using descriptive names, however, its performance collapsed when the names were replaced with meaningless letters. This, and the fact that its explanations seem to be rather vague about the $\times 4$ scaling issue, suggest that the reason for the good performance might be a combination of the model's trust in the function and variable names in the code and the surface level similarities to the standard Monte Carlo approximation method. The vagueness around the scaling factor was also a common tactic in those explanations by non-reasoning models which were classified into the "approximates $\pi$" (accurate) category.

    Claude Opus 4 (with and without reasoning), GPT 4.1, and Sonar Reasoning Pro performed rather poorly in the $\pi$ Approximation experiment. Syntactic obfuscation confused all non-reasoning models even further, while it improved the performance of reasoning models. Manual review of the explanations of the latter group indicates that they usually successfully figure out the trick with the scaling.

  • Google Gemini 2.5 Pro performed flawlessly in all of the experiments, with all source code variants, which probably has to do with the fact that this model gave the longest, most detailed explanations for each analyzed code snippet.

  • Keeping the original function and variable names and asking the AI not to rely on them due to their potentially misleading nature, and emphasizing the importance of careful analysis and the verification of assumptions can be beneficial to the accuracy metric, but this method can also increase the probability of false-positive bug reports.

The raw model answers are available in the GitHub repository.

The average accuracies and the most common mistakes are shown in the plots below. Models are highlighted in bold where the accuracy change (compared to the experiments with descriptive names) is statistically significant ($p \le 5\%$) according to Fisher's excat test.

Longest Palindromic Substring¶

In [43]:
try:
    plot_results(
        "Longest Palindromes",
        longest_palindromic_susbtring_meaningful_names_results,
        longest_palindromic_susbtring_meaningless_names_results,
        longest_palindromic_susbtring_ignore_names_results,
    )
except:
    print("Run all the blocks in the Appendix: Code section first.")
No description has been provided for this image

Twin Primes¶

All models figured out the purpose of the algorithm in this experiment (even without the clue in the function's name), but some claimed that the primality testing logic contained a critical flaw which would prevent the algorithm from correctly identifying prime numbers. Since the algorithm is actually correct, these responses were considered inaccurate.

In [44]:
try:
    plot_results(
        "Twin Primes",
        twin_primes_meaningful_names_results,
        twin_primes_meaningless_names_results,
        twin_primes_ignore_names_results,
    )
except:
    print("Run all the blocks in the Appendix: Code section first.")
No description has been provided for this image

Standardization¶

In [45]:
try:
    plot_results(
        "Standardization",
        std_meaningful_names_results,
        std_meaningless_names_results,
        std_ignore_names_results,
    )
except:
    print("Run all the blocks in the Appendix: Code section first.")
No description has been provided for this image

HighLife¶

In [46]:
try:
    plot_results(
        "HighLife",
        highlife_meaningful_names_results,
        highlife_meaningless_names_results,
        highlife_ignore_names_results,
    )
except:
    print("Run all the blocks in the Appendix: Code section first.")
No description has been provided for this image

$\pi$ Approximation¶

Again, all models figured out the main purpose of the algorithm, but they often failed to recognize that the implementation, though somewhat unusual, is actually correct. The most common misunderstanding was that the code would actually approximate $\frac{\pi}{4}$ instead of $\pi$, and it was often reported as a bug.

Interestingly, non-reasoning models had more success with the human-readable version, while reasoning models improved their performance when the code was using meaningless variable names. (Gemini 2.5 Pro Preview 06-05 aced this challenge in both variants.)

The apparent successes of non-reasoning models with the code using descriptive names was usually a result of the models' trust in the provided function name, and a vague, superficial explanation for the implicit scaling.

In [47]:
try:
    plot_results(
        "$\pi$ Approximation",
        pi_meaningful_names_results,
        pi_meaningless_names_results,
        pi_ignore_names_results,
    )
except:
    print("Run all the blocks in the Appendix: Code section first.")
No description has been provided for this image

Further Research¶

  • Does the presence or the lack of descriptive names affect the correctness of code modifications made by AI?

Appendix: Code¶

Dependencies¶

In [1]:
# !pip install matplotlib==3.10.0
# !pip install numpy==2.2.3
# !pip install pandas==2.2.3
# !pip install requests==2.32.3
# !pip install scipy==1.15.2
In [2]:
import collections.abc as collabc
import functools
import gzip
import inspect
import json
import math
import os
import os.path
import re
import time
import typing
import urllib.parse

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import scipy.stats

API keys¶

My personal API keys are not included in the public repository, so generating new model responses will require setting these up. See the api-keys.json.example file for the details. Note however that the notebook can be run without any API keys if the cache directory from the GitHub repository is available.

In [3]:
api_keys_filename = "api-keys.json"

if not os.path.isfile(api_keys_filename):
    raise RuntimeError(f"API keys file not found: {api_keys_filename!r}")

with open(api_keys_filename, "r") as f:
    api_keys = json.load(f)


print("API keys: " + ", ".join(sorted(api_keys.keys())))
API keys: anthropic, deepseek, google, openai, perplexity

Common Utilities¶

This block contains a convenience function for sending the same system and user prompts to all the models, as well as various cached HTTP request related utilities.

Caching all the requests and responses makes debugging and re-running the notebook easier and quicker, but sensitive and potentially sensitive data like API keys and various identifiers need to be removed from the cached data so that they are safe to be published on GitHub.

In [4]:
MAX_OUT_TOKENS = 32000
MAX_REASONING_TOKENS = 16000
TEMPERATURE = 1.0

MODELS = {
    "sonnet": "claude-opus-4-20250514",
    "deepseek": "deepseek-chat",  # DeepSeek-V3 as of Jun 2025
    "gemini": "gemini-2.5-pro-preview-06-05",
    "gpt4": "gpt-4.1-2025-04-14",
    "perplexity": "sonar-reasoning-pro",
}

MODEL_FN = {}

MODEL_R = {
    "sonnet": [0, MAX_REASONING_TOKENS],
    "deepseek": [0],
    "gemini": [MAX_REASONING_TOKENS],
    "gpt4": [0],
    "perplexity": [MAX_REASONING_TOKENS],
}


def query_all(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
):
    for model_name, query_fn in MODEL_FN.items():
        for reasoning_budget in MODEL_R[model_name]:
            response, thoughts = query_fn(
                experiment_name,
                system_prompt,
                user_prompt,
                temperature,
                max_out_tokens,
                reasoning_budget,
            )

            yield MODELS[model_name], reasoning_budget, response, thoughts


def send_cached_post_request(
        cache_filename: str,
        url: str,
        request_headers: collabc.Mapping,
        request_body: collabc.Mapping,
        sensitive_headers: collabc.Container=(),
        sensitive_body_fields: collabc.Container=(),
):
    sensitive_headers = {h.lower() for h in sensitive_headers}
    sensitive_body_fields = {f.lower() for f in sensitive_body_fields}

    cache_dir = os.path.dirname(cache_filename)

    os.makedirs(cache_dir, exist_ok=True)
    
    if os.path.isfile(cache_filename):
        with gzip.open(cache_filename, "rt") as f:
            return json.load(f)
    
    try:
        response = requests.post(url, headers=request_headers, json=request_body)
        response.raise_for_status()

        result = {
            "request": {
                "headers": del_items(request_headers, sensitive_headers),
                "body": del_items(request_body, sensitive_body_fields),
            },
            "response": {
                "headers": del_items(response.headers, sensitive_headers),
                "body": del_items(response.json(), sensitive_body_fields),
            }
        }

        with gzip.open(cache_filename, "wt", compresslevel=9) as f:
            json.dump(result, f, indent=2)

        return result

    except Exception as exc:
        print(f"Exception: ({type(exc)}) {exc}")

        if hasattr(exc, "response") and exc.response is not None:
            print(f"Response status code: {exc.response.status_code}")
            print(f"Response body: {exc.response.text}")

        raise


def build_cache_filename(experiment_name: str, model_name: str, temperature: float):
    experiment_name = experiment_name.strip()
    experiment_dir = os.path.dirname(experiment_name)
    experiment_file = os.path.basename(experiment_name)

    if experiment_dir == "":
        experiment_dir = experiment_file

    return os.path.join(
        "cache",
        experiment_dir,
        (f"{experiment_file}-{model_name}-t{temperature:.3f}".replace(".", "_")) + ".json.gz",
    )


def get_item(container, path: str, default=None):
    """
    Extract data from nested dicts and lists based on a dot-separated
    path string. See test_get_item() for examples.
    """

    if path == "." or path == "":
        return container

    path = path.split(".")

    for key in path:
        if isinstance(container, collabc.Mapping):
            if key in container:
                container = container[key]
            else:
                return default
        elif isinstance(container, collabc.Sequence):
            if int(key) < len(container):
                container = container[int(key)]
            else:
                return default
        else:
            return default

    return container


def del_items(container, patterns: typing.List[str]):
    """
    Return a copy of a nested dicts and lists object with the
    values matching the given set of dot-separated paths removed.
    The "*" character acts as a wildcard. See test_del_items()
    for examples.
    """

    def should_include(path: list, exclude_patterns: typing.List[tuple]) -> bool:
        return not any(path_matches_pattern(path, ptrn) for ptrn in exclude_patterns)

    def copy_recursive(obj, path: list, exclude_patterns: typing.List[tuple]):
        if isinstance(obj, str):
            return obj

        if isinstance(obj, collabc.Mapping):
            copy = {}

            for k, v in obj.items():
                path_ext = path + [k]

                if should_include(path_ext, exclude_patterns):
                    copy[k] = copy_recursive(v, path_ext, exclude_patterns)

            return copy

        if isinstance(obj, collabc.Sequence):
            copy = []

            for k, v in enumerate(obj):
                path_ext = path + [str(k)]

                if should_include(path_ext, exclude_patterns):
                    copy.append(copy_recursive(v, path_ext, exclude_patterns))

            return copy

        return obj

    for pattern in patterns:
        if pattern == "." or pattern == "":
            return ValueError(f"Invalid pattern; {pattern=!r}")

    patterns = [tuple(pattern.lower().split(".")) for pattern in patterns]
    
    return copy_recursive(container, [], patterns)


def path_matches_pattern(path: collabc.Sequence, pattern: collabc.Sequence) -> bool:
    if len(path) != len(pattern):
        return False

    for path_component, pattern_component in zip(path, pattern):
        matches = (
            pattern_component == "*"
            or pattern_component == path_component.lower()
        )

        if not matches:
            return False

    return True


def split_lines(text: str) -> list:
    """
    Normalize line-breaks (Windows, Linux, Mac, etc.) then split
    the given text into separate lines.
    """

    return (
        text.replace("\r\n", "\n")
            .replace("\r", "\n")
            .strip()
            .split("\n")
    )


def test_get_item():
    container = {"aaa": [{"bbb": "42", "ccc": "123"}]}

    assert_eq("42", get_item(container, "aaa.0.bbb"))
    assert_eq(None, get_item(container, "aaa.2.zzz"))


def test_del_items():
    container = {"aaa": [{"bbb": "42", "ccc": "123", "ddd": "hello"}]}

    assert_eq({"aaa": [{"ddd": "hello"}]}, del_items(container, ["aaa.*.ccc", "*.*.bbb", "zzz"]))


def assert_eq(a, b):
    assert a == b, f"Failed to assert that a = b; {a=!r}, {b=!r}"


test_get_item()
test_del_items()

API Clients¶

Anthropic Claude Client¶

In [5]:
def query_claude_sonnet(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_budget: int=MAX_REASONING_TOKENS,
):
    # https://docs.anthropic.com/en/api/messages

    model_name = MODELS["sonnet"]
    suffix = "-nothink"
    thinking = {"type": "disabled"}

    # https://console.anthropic.com/settings/limits
    max_out_tokens = min(64000, max_out_tokens)

    if reasoning_budget > 0:
        # Thinking requires temperature to be exactly 1.
        temperature = 1
        reasoning_budget = min(int(max_out_tokens * 0.7) + 1, reasoning_budget)
        suffix = "-think"
        thinking = {
            "type": "enabled",
            "budget_tokens": reasoning_budget,
        }

    cache_filename = build_cache_filename(experiment_name, model_name + suffix, temperature)
    request_headers = {
        "x-api-key": api_keys["anthropic"],
        "anthropic-version": "2023-06-01",
        "content-type": "application/json"
    }
    request_body = {
        "model": model_name,
        "max_tokens": max_out_tokens,
        "temperature": temperature,
        "stream": False,
        "system": system_prompt,
        "thinking": thinking,
        "messages": [
            {"role": "user", "content": user_prompt}
        ]
    }
    result = send_cached_post_request(
        cache_filename,
        "https://api.anthropic.com/v1/messages",
        request_headers,
        request_body,
        sensitive_headers=["x-api-key", "anthropic-organization-id", "request-id", "CF-RAY"],
        sensitive_body_fields=["id"],
    )

    text = None
    thoughts = None
    
    for content in get_item(result, "response.body.content"):
        content_type = get_item(content, "type")

        if content_type == "text":
            text = content["text"]
        elif content_type == "thinking":
            thoughts = content["thinking"]

    return text, thoughts


MODEL_FN["sonnet"] = query_claude_sonnet

DeepSeek Client¶

In [6]:
def query_deepseek(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_budget: int=MAX_REASONING_TOKENS,
):
    # https://api-docs.deepseek.com/api/create-chat-completion

    if reasoning_budget > 0:
        raise NotImplementedError()
    
    max_out_tokens = min(8192, max_out_tokens)
    model_name = MODELS["deepseek"]
    cache_filename = build_cache_filename(experiment_name, model_name + "-nothink", temperature)
    request_headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + api_keys["deepseek"],
    }
    request_body = {
        "model": model_name,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "max_tokens": max_out_tokens,
        "response_format": {"type": "text"},
        "stream": False,
        "temperature": temperature,
    }
    result = send_cached_post_request(
        cache_filename,
        "https://api.deepseek.com/chat/completions",
        request_headers,
        request_body,
        sensitive_headers=["Authorization", "Set-Cookie", "x-ds-trace-id", "CF-RAY"],
        sensitive_body_fields=["id"],
    )

    text = None
    thoughts = None
    
    for choice in get_item(result, "response.body.choices"):
        if get_item(choice, "message.role") == "assistant":
            text = get_item(choice, "message.content")

    return text, thoughts


MODEL_FN["deepseek"] = query_deepseek

Google Gemini Client¶

In [7]:
def query_gemini(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_budget: int=MAX_REASONING_TOKENS,
        system_prompt_key: str="systemInstruction",
):
    # https://ai.google.dev/gemini-api/docs/text-generation
    # https://ai.google.dev/api/generate-content#method:-models.generatecontent

    reasoning_budget = min(32768, reasoning_budget)
    max_out_tokens = max(reasoning_budget + 128, max_out_tokens)
    
    model_name = MODELS["gemini"]
    suffix = "-nothink"
    thinking = {
        "includeThoughts": False,
        "thinkingBudget": 0,
    }

    if reasoning_budget > 0:
        suffix = "-think"
        thinking = {
            "includeThoughts": True,
            "thinkingBudget": reasoning_budget,
        }

    cache_filename = build_cache_filename(experiment_name, model_name, temperature)
    request_headers = {
        "Content-Type": "application/json",
    }
    request_body = {
        system_prompt_key: {
            "parts": [{"text": system_prompt}],
        },
        "contents": [
            {"parts": [{"text": user_prompt}]},
        ],
        "generationConfig": {
            "temperature": temperature,
            "maxOutputTokens": max_out_tokens,
            "responseModalities": ["text"],
            "thinkingConfig": thinking,
        },
    }
    url = "".join(
        (
            "https://generativelanguage.googleapis.com/v1beta/models/",
            urllib.parse.quote_plus(model_name),
            ":generateContent?key=",
            urllib.parse.quote_plus(api_keys["google"]),
        )
    )
    result = send_cached_post_request(
        cache_filename,
        url,
        request_headers,
        request_body,
        sensitive_headers=[],
        sensitive_body_fields=[],
    )

    text = None
    thoughts = None
    
    for candidate in get_item(result, "response.body.candidates"):
        if get_item(candidate["content"], "role") == "model":
            for part in get_item(candidate, "content.parts"):
                part_text = get_item(part, "text")

                if part_text is not None:
                    if get_item(part, "thought"):
                        thoughts = part_text
                    else:
                        text = part_text

    return text, thoughts


MODEL_FN["gemini"] = query_gemini

As of June, 2025, some of the API documentation of Gemini uses snake_case for the system prompt field, other parts of the documentation use camelCase. The code below attempts to use both in order to see if any or both are actually accepted by the API.

In [8]:
print("# system_instruction:")
print(
    query_gemini(
        'pirate-snake_case',
        "Talk like a pirate.",
        "Explain in one brief sentence why the sky is blue.",
        system_prompt_key="system_instruction",
    )[0]
)
print("")
print("# systemInstruction:")
print(
    query_gemini(
        'pirate-camelCase',
        "Talk like a pirate.",
        "Explain in one brief sentence why the sky is blue.",
        system_prompt_key="systemInstruction",
    )[0]
)
# system_instruction:
Arrr, the sky be blue 'cause the air scatters the sun's blue light about more than all the other colors in its treasure chest

# systemInstruction:
Arrr, the sky be blue 'cause the air scatters the sun's blue light more than all the other colors, savvy?

OpenAI Client¶

In [9]:
def query_openai(
        model_name: str,
        accepts_temperature: bool,
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_budget: int=0,
):
    # https://platform.openai.com/docs/guides/text?api-mode=responses
    # https://platform.openai.com/docs/api-reference/responses/create

    if reasoning_budget > 0:
        raise NotImplementedError()

    cache_filename = build_cache_filename(experiment_name, model_name + "-nothink", temperature)
    request_headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + api_keys["openai"],
    }
    request_body = {
        "model": model_name,
        "max_output_tokens": max_out_tokens,
        "input": [
            {"role": "developer", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "stream": False,
    }

    if accepts_temperature:
        request_body["temperature"] = temperature
    
    result = send_cached_post_request(
        cache_filename,
        "https://api.openai.com/v1/responses",
        request_headers,
        request_body,
        sensitive_headers=["Authorization", "openai-organization", "openai-project", "x-request-id", "Set-Cookie", "CF-RAY"],
        sensitive_body_fields=["id", "output.*.id"],
    )

    text = None
    thoughts = None
    
    for output in get_item(result, "response.body.output"):
        if get_item(output, "type") == "message" and get_item(output, "role") == "assistant":
            for content in get_item(output, "content", []):
                if get_item(content, "type") == "output_text":
                    text = get_item(content, "text")

    return text, thoughts


query_gpt4 = functools.partial(query_openai, MODELS["gpt4"], True)

MODEL_FN["gpt4"] = query_gpt4

Perplexity AI Client¶

In [10]:
def query_perplexity(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_budget: int=MAX_REASONING_TOKENS,
):
    # https://docs.perplexity.ai/guides/getting-started
    # https://docs.perplexity.ai/api-reference/chat-completions

    model_name = MODELS["perplexity"]

    cache_filename = build_cache_filename(experiment_name, model_name + "-think", temperature)
    request_headers = {
        "accept": "application/json",
        "content-type": "application/json",
        "Authorization": "Bearer " + api_keys["perplexity"],
    }
    request_body = {
        "model": model_name,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "max_tokens": max_out_tokens,
        "temperature": temperature,
        "return_related_questions": False,
        "stream": False,
        "web_search_options": {
            "search_context_size": "low",
        },
    }
    result = send_cached_post_request(
        cache_filename,
        "https://api.perplexity.ai/chat/completions",
        request_headers,
        request_body,
        sensitive_headers=["Authorization", "Set-Cookie", "CF-RAY", ],
        sensitive_body_fields=["id"],
    )

    text = None
    thoughts = None
    
    for choice in get_item(result, "response.body.choices"):
        if get_item(choice, "message.role") == "assistant":
            response = get_item(choice, "message.content").split("</think>", 1)

    if len(response) == 1:
        text = response[0]
    elif len(response) == 2:
        thoughts = response[0]

        if thoughts.startswith("<think>"):
            thoughts = thoughts[7:]

        text = response[1]

    return text, thoughts


MODEL_FN["perplexity"] = query_perplexity

Plotting¶

In [11]:
def plot_results(
        title,
        meaningful_names_results_df,
        meaningless_names_results_df,
        ignore_names_results_df,
        include_lengths=False,
        significance=0.05,
):
    group_by_cols = ["experiment", "model", "reasoning_budget", "temperature"]
    models = sorted(meaningful_names_results_df["model"].unique())

    response_length_ylim = (
        int(
            min(
                meaningful_names_results_df["response_length"].min(),
                meaningless_names_results_df["response_length"].min(),
                ignore_names_results_df["response_length"].min(),
            ) * 0.9
        ),
        int(
            max(
                meaningful_names_results_df["response_length"].max(),
                meaningless_names_results_df["response_length"].max(),
                ignore_names_results_df["response_length"].max(),
            ) * 1.1
        ),
    )
    reasoning_length_ylim = (
        int(
            min(
                meaningful_names_results_df["reasoning_length"].min(),
                meaningless_names_results_df["reasoning_length"].min(),
                ignore_names_results_df["reasoning_length"].min(),
            ) * 0.9
        ),
        int(
            max(
                meaningful_names_results_df["reasoning_length"].max(),
                meaningless_names_results_df["reasoning_length"].max(),
                ignore_names_results_df["reasoning_length"].max(),
            ) * 1.1
        ),
    )

    rows = 3 if include_lengths else 1
    fig, axs = plt.subplots(rows, 3, figsize=(12, rows * 5))

    for i, (variant, results_df) in enumerate(
            [
                ("Baseline", meaningful_names_results_df),
                ("f(a,b,c)", meaningless_names_results_df),
                ("Ignore Names", ignore_names_results_df),
            ]
    ):
        grouped = (
            results_df
                .drop(columns=["summary", "reasoning_length", "response_length"])
                .groupby(group_by_cols)
                .mean()
                .reset_index()
                .rename(
                    columns={
                        "accurate": "Accuracy",
                        "suspects_bug": "Suspects Bug",
                        "pi_quarter": "$\pi/4$",
                    },
                )
        )
        grp_all_columns = set(grouped.columns)
        grp_err_columns = list(sorted(grp_all_columns - (set(group_by_cols).union({"Accuracy", "Suspects Bug", "i"}))))
        err_colors = ["tab:orange"]
        colors = ["tab:blue"] + err_colors[:len(grp_err_columns)] + ["tab:red"]

        significant_changes = set()

        for model in models:
            before_mask = meaningful_names_results_df["model"] == model
            before_total = np.sum(before_mask)
            before_accurate = np.sum(meaningful_names_results_df[before_mask]["accurate"])
            before_inaccurate = before_total - before_accurate

            after_mask = results_df["model"] == model
            after_total = np.sum(after_mask)
            after_accurate = np.sum(results_df[after_mask]["accurate"])
            after_inaccurate = after_total - after_accurate

            ft = scipy.stats.fisher_exact(
                [
                    [before_accurate, before_inaccurate],
                    [after_accurate, after_inaccurate],
                ],
                alternative="two-sided",
            )

            if ft.pvalue <= significance:
                significant_changes.add(model)
        
        bottom = np.zeros(len(models))
        ax_idx = (0, i) if rows > 1 else i

        for j, col in enumerate(["Accuracy"] + grp_err_columns + ["Suspects Bug"]):
            if col not in grp_all_columns:
                continue

            values = np.array(
                [grouped[grouped["model"] == model][col].values[0] for model in models]
            )
            axs[ax_idx].bar(models, values, bottom=bottom, color=colors[j], label=col)
            bottom += values

        axs[ax_idx].set_ylim((0.0, 1.05))
        axs[ax_idx].set_title(f"{title} Avg. Accuracy ({variant})")
        axs[ax_idx].tick_params("x", rotation=60)
        axs[ax_idx].legend()

        for label in axs[ax_idx].get_xticklabels():
            if label.get_text() in significant_changes:
                label.set_fontweight("bold")

        plt.setp(axs[ax_idx].get_xticklabels(), horizontalalignment="right")

        if include_lengths:
            response_lengths = []
            reasoning_lengths = []
    
            for model in models: 
                model_results_df = results_df[results_df["model"] == model]
                response_lengths.append(model_results_df["response_length"])
                reasoning_lengths.append(model_results_df["reasoning_length"])
    
            axs[1, i].boxplot(response_lengths, tick_labels=models)
    
            axs[1, i].set_title(f"{title} Expl. Len. ({variant})")
            axs[1, i].tick_params("x", rotation=60)
            axs[1, i].set_ylim(response_length_ylim)
    
            plt.setp(axs[1, i].get_xticklabels(), horizontalalignment="right")
    
            axs[2, i].boxplot(reasoning_lengths, tick_labels=models)
    
            axs[2, i].set_title(f"{title} Reas. Len. ({variant})")
            axs[2, i].tick_params("x", rotation=60)
            axs[2, i].set_ylim(reasoning_length_ylim)
    
            plt.setp(axs[2, i].get_xticklabels(), horizontalalignment="right")

    plt.tight_layout()
    plt.show()

Experiments¶

In [12]:
REPEATS = 10


ignore_names = """

Please break down the code step by step. Try to understand what each instruction \
is doing and how they fit into the big picture. But most importantly: **avoid \
relying on function and variable names** because they can sometimes be deceiving. \
Instead, try to figure out how this code really works. Verify your assumptions and \
pay attention to the fine details; your first impressions might be misleading.
"""


def explain_code(
        experiment_name_tpl: str,
        source_code: str,
        validity_check: typing.Union[re.Pattern, collabc.Callable],
        invalid_ptrns: typing.Dict[str, re.Pattern],
        repeats=REPEATS,
        temperature: float=TEMPERATURE,
        expertise: str="Python",
        extra_instructions: str="",
) -> pd.DataFrame:
    backlog = []
    results = {
        "experiment": [],
        "i": [],
        "model": [],
        "reasoning_budget": [],
        "reasoning_length": [],
        "response_length": [],
        "temperature": [],
        "summary": [],
        "accurate": [],
    }

    for key in invalid_ptrns.keys():
        results[key] = []
    
    for i in range(repeats):
        experiment_name = os.path.join(experiment_name_tpl, f"{experiment_name_tpl}-{i}")
        backlog.append((experiment_name, 0, i))

    system_prompt = f"Please act as a helpful AI assistant who is also an expert in {expertise}."
    user_prompt = f"""\
Can you please tell me what does the function below do?

```
{source_code}
```

Please give a detailed explanation first, and after that, close with a single sentence executive summary in the following format:

**Summary**: single sentence summary here.{extra_instructions}
"""
    
    summary_re = re.compile(r"#* *\**Summary\**:* *(.*)$", re.IGNORECASE)
    
    while len(backlog) > 0:
        experiment_name, tries, i = backlog.pop(0)

        try:
            responses = query_all(experiment_name, system_prompt, user_prompt, temperature=temperature)

            for model_name, reasoning_budget, response, thoughts in responses:
                response = str(response)
                thoughts = str(thoughts) if thoughts is not None else ""
                think = "think" if reasoning_budget > 0 else "nothink"

                log_1 = f"{model_name=!r}, {reasoning_budget=}, {tries=}, {i=}"

                print(f"# {len(backlog)=}")
                print(f"# {log_1}")

                summary = ""
                next_line_is_summary = False

                for line in response.replace("\r\n", "\n").replace("\r", "\n").split("\n"):
                    line = line.strip()

                    if match := summary_re.match(line.strip()):
                        summary = match[1].strip().strip("**: ")
                        next_line_is_summary = summary == ""

                    elif next_line_is_summary and line != "":
                        next_line_is_summary = False
                        summary = line.strip()

                if isinstance(validity_check, re.Pattern):
                    invalid = {key: ptrn.search(summary) for key, ptrn in invalid_ptrns.items()}
                    is_valid = validity_check.search(summary) and not any(match for match in invalid.values())
                else:
                    eval_experiment_filename = os.path.join(
                        f"eval-{experiment_name_tpl}",
                        f"eval-{experiment_name_tpl}-{model_name}-{think}-i{i}-t{temperature:.1f}",
                    )
                    eval_experiment_filename = eval_experiment_filename.replace(".", "_")
                    is_valid, invalid = validity_check(response, eval_experiment_filename)

                log_2 = f"  {is_valid=} {invalid=}"

                print(f"# {log_2}")
                print(f"  {summary=}")
                print("")

                response_filename_base = f"response-{experiment_name_tpl}-{model_name}-{think}-i{i}-t{temperature:.1f}"
                response_filename_base = response_filename_base.replace(".", "_") + ".txt"
                response_filename = os.path.join(
                    "data",
                    f"responses-{experiment_name_tpl}",
                    response_filename_base,
                )
                os.makedirs(os.path.dirname(response_filename), exist_ok=True)

                with open(response_filename, "w") as f:
                    print(f"    {log_1}", file=f)
                    print(f"    {log_2}", file=f)
                    print("", file=f)
                    print("    System prompt:", file=f)
                    print("    " + system_prompt.replace("\n", "\n    "), file=f)
                    print("", file=f)
                    print("    User prompt:", file=f)
                    print("    " + user_prompt.replace("\n", "\n    "), file=f)
                    print("", file=f)
                    print("<think>", file=f)
                    print(thoughts, file=f)
                    print("</think>", file=f)
                    print("", file=f)
                    print(response, file=f)
                    print("", file=f)

                results["experiment"].append(experiment_name_tpl)
                results["i"].append(i)
                results["model"].append(f"({think}) {model_name}")
                results["reasoning_budget"].append(reasoning_budget)
                results["reasoning_length"].append(len(thoughts))
                results["response_length"].append(len(response))
                results["temperature"].append(temperature)
                results["summary"].append(summary)
                results["accurate"].append(1 if is_valid else 0)

                problem_found = False

                for key, match in invalid.items():
                    if match and not problem_found:
                        results[key].append(1)
                        problem_found = True
                    else:
                        results[key].append(0)

        except AssertionError:
            raise
            
        except Exception as exc:
            print(f"  Exception ({tries=}): ({type(exc)}) {exc}")

            if hasattr(exc, "response") and exc.response is not None:
                print(f"    Response status code: {exc.response.status_code}")
                print(f"    Response body: {exc.response.text}")

            backlog.append((experiment_name, tries + 1, i))
            time.sleep(max(3, min(5, tries)))

    results_df = pd.DataFrame(results)
    results_df.to_csv(os.path.join("data", f"{experiment_name_tpl}.csv"), index=False)

    return results_df

Longest Palindromic Substring¶

Find the longest palindromic substring within a string. (A palindrome is a string which reads the same both forwards and backwards.)

A textbook example solution to this problem might look something like the following:

In [13]:
def find_longest_palindromic_substring_basic(text: str, palindromic_substring_positions=None) -> str:
    text_end = len(text)

    if text_end < 2:
        return text

    if palindromic_substring_positions is None:
        palindromic_substring_positions = (
            [(pos, 1) for pos in range(text_end)]
            + [
                (pos, 2)
                for pos in range(text_end - 1)
                if text[pos] == text[pos + 1]
            ]
        )

    extended_palindromic_substring_positions = []

    for pos, length in palindromic_substring_positions:
        if pos == 0:
            continue

        prev_pos = pos - 1
        next_pos = pos + length

        if next_pos < text_end and text[prev_pos] == text[next_pos]:
            extended_palindromic_substring_positions.append(
                (prev_pos, length + 2)
            )

    if len(extended_palindromic_substring_positions) == 0:
        pos, length = max(palindromic_substring_positions, key=lambda e: e[1])

        return text[pos:pos + length]

    return find_longest_palindromic_substring_basic(text, extended_palindromic_substring_positions)


def test_find_longest_palindromic_substring(func):
    assert_eq("", func(""))
    assert_eq("a", func("abcdef"))

    assert_eq("a", func("a"))
    assert_eq("aa", func("aa"))
    assert_eq("aaa", func("aaa"))
    assert_eq("aaaa", func("aaaa"))
    assert_eq("aaaaa", func("aaaaa"))

    assert_eq("a", func("abc"))
    assert_eq("aa", func("baac"))
    assert_eq("aaa", func("baaac"))
    assert_eq("aaaa", func("baaaac"))
    assert_eq("aaaaa", func("baaaaac"))

    assert_eq("bab", func("bab"))
    assert_eq("baab", func("baab"))
    assert_eq("cc", func("abccdef"))
    assert_eq("aba", func("ababcdef"))
    assert_eq("efe", func("abcdefe"))
    assert_eq("aa", func("aabcde"))
    assert_eq("ee", func("abcdee"))
    assert_eq("abba", func("abbacde"))
    assert_eq("effe", func("abcdeffe"))
    assert_eq("baab", func("abaabcdef"))
    assert_eq("bcb", func("abcbd"))
    assert_eq("deed", func("abcdeedf"))
    assert_eq("BACACAB", func("abcdBABefghBABABijklmnopqBACACABrst"))


test_find_longest_palindromic_substring(find_longest_palindromic_substring_basic)

This algorithm finds all trivial palindromic substrings (single characters and twice repeated characters), then keeps iteratively expanding them. When that is no longer possible, it returns the longest one.

Complication: the code collects and stores all substring expansions regardless of whether they are palindromic, and then, in a separate step, filters out those that are not.

In [14]:
def find_longest_palindromic_substring(text: str, palindromic_substring_positions=None) -> str:
    text_end = len(text)

    if text_end < 2:
        return text

    if palindromic_substring_positions is None:
        palindromic_substring_positions = (
            [(pos, 1) for pos in range(text_end)]
            + [(pos, 2) for pos in range(text_end - 1)]
        )
        palindromic_substring_positions = [
            (pos, length)
            for pos, length in palindromic_substring_positions
            if text[pos] == text[pos + length - 1]
        ]

    extended_palindromic_substring_positions = []

    for pos, length in palindromic_substring_positions:
        if pos == 0:
            continue

        if pos + length < text_end:
            extended_palindromic_substring_positions.append(
                (pos - 1, length + 2)
            )

    extended_palindromic_substring_positions = [
        (pos, length)
        for pos, length in extended_palindromic_substring_positions
        if text[pos] == text[pos + length - 1]
    ]

    if len(extended_palindromic_substring_positions) == 0:
        pos, length = max(palindromic_substring_positions, key=lambda e: e[1])

        return text[pos:pos + length]

    return find_longest_palindromic_substring(text, extended_palindromic_substring_positions)


def f(a: str, b=None) -> str:
    c = len(a)

    if c < 2:
        return a

    if b is None:
        b = (
            [(d, 1) for d in range(c)]
            + [(d, 2) for d in range(c - 1)]
        )
        b = [
            (d, e)
            for d, e in b
            if a[d] == a[d + e - 1]
        ]

    g = []

    for d, e in b:
        if d == 0:
            continue

        if d + e < c:
            g.append(
                (d - 1, e + 2)
            )

    g = [
        (d, e)
        for d, e in g
        if a[d] == a[d + e - 1]
    ]

    if len(g) == 0:
        d, e = max(b, key=lambda e: e[1])

        return a[d:d + e]

    return f(a, g)


test_find_longest_palindromic_substring(find_longest_palindromic_substring)
test_find_longest_palindromic_substring(f)

longest_palindromic_substring_valid_re = re.compile(
    (
        r"(longest( (contiguous|valid))? (palindrom(ic|e)))"
        r"|(palindromic substrings*.*longest)"
        r"|(palindromic.*longest candidate)"
        r"|(longest substring.*reads.*same.*forward.*backward)"
        r"|(longest substring.*palindrome)"
    ),
    re.IGNORECASE,
)

longest_palindromic_substring_invalid_ptrns = {
}

longest_palindromic_substring_meaningful_names = inspect.getsource(find_longest_palindromic_substring)
longest_palindromic_substring_meaningless_names = inspect.getsource(f)
Meaningful Names¶
In [15]:
longest_palindromic_susbtring_meaningful_names_results = explain_code(
    "lps-meaningful-names",
    longest_palindromic_substring_meaningful_names,
    longest_palindromic_substring_valid_re,
    longest_palindromic_substring_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with 1 and 2-character palindromes and recursively expanding them outward until the longest possible palindrome is found.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with all 1 and 2 character palindromes and iteratively expanding them outward until no valid expansions remain.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function recursively finds and extends palindromic substrings to identify the longest palindrome in a given string using dynamic programming principles.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by recursively identifying all palindromes of increasing length, starting with single and double characters and expanding them outwards until no longer palindromes can be found.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary="This function finds the longest palindromic substring in a given string by iteratively expanding all current palindromic substrings until it can't expand any further and then returns the longest one found."

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary='The function uses center-expansion recursion to efficiently discover the longest palindromic substring by iteratively extending valid candidates.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by recursively extending smaller palindromes outward until no further extension is possible, then returning the longest palindrome found.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a string by starting with all single characters and matching adjacent pairs as seed palindromes, then iteratively extending them outward until no further valid extensions are possible.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by dynamically expanding and validating potential palindromes from smaller substrings.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all 1 and 2-character palindromes and repeatedly expanding them outwards until no further expansion is possible.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='The function recursively finds and expands all palindromic substrings within a string to return the longest palindromic substring present.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function uses recursive center expansion to efficiently find the longest palindromic substring by iteratively extending valid palindromic seeds (single characters or matched pairs) and returning the maximal expansion.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a text by starting with single characters and adjacent pairs, then recursively expanding valid palindromes outward until the longest possible palindrome is found.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a text by recursively expanding from all single and two-character palindromes outward until no further valid expansions are possible.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by dynamically extending and validating smaller palindromic substrings.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring using a recursive "expand from center" method, starting with all 1 and 2-character palindromes and repeatedly extending them until no longer palindromes can be formed.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function recursively finds and expands palindromic substrings to return the longest palindromic substring within a given input string.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function recursively expands validated palindromic centers to efficiently identify the longest palindromic substring in a given string.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a text by starting with all 1-2 character palindromes and recursively expanding them outward until finding the maximum length palindrome.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by starting with all 1 and 2 character palindromes and recursively expanding them outward until no further valid expansions are possible.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given text by progressively expanding valid palindrome candidates from length 1 and 2 seeds.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by recursively expanding from the "center" of all known shorter palindromes until no further expansion is possible.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function recursively finds and expands seed palindromic substrings to efficiently determine the longest palindromic substring within a given text.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='The function recursively expands palindromic seeds (single characters and pairs) outward until no further symmetric expansion is possible, then returns the longest valid palindrome found.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all single characters and adjacent pairs as seed palindromes, then repeatedly expanding them outward until no further valid expansions are possible.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by starting with all 1 and 2 character palindromes and recursively expanding them outward until no valid extensions remain.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by dynamically expanding and validating potential palindromes from single and double-character seeds.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all 1 and 2-character palindromes and repeatedly expanding them outwards until no further expansion is possible.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within a given input string by iteratively expanding all possible palindromic substrings.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='The function recursively expands small palindromic cores to efficiently locate the longest contiguous palindrome in a string.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by recursively expanding valid palindromes from their centers outward until no further expansion is possible.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a text by recursively expanding from smaller palindromes (starting with 1 and 2 character palindromes) outward until no further valid extensions are possible.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by dynamically extending and validating potential palindromes from smaller substrings.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all single and two-character palindromes and repeatedly expanding them outwards until no longer palindromes can be found.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring within a given string by expanding all initially possible palindromic centers and selecting the largest found.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='The function recursively expands initially valid palindromic centers (single characters or identical pairs) until no further symmetric growth is possible, then returns the longest palindrome found.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all single characters and adjacent pairs as seeds, then expanding valid palindromes outward by one character on each side until no further expansion is possible.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with all 1 and 2-character palindromes and iteratively expanding them outward until no further valid expansions are possible.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by dynamically extending smaller palindromic substrings and checking for validity.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='The function recursively finds the longest palindromic substring by identifying all base palindromes of length one and two and then repeatedly expanding them outwards until they can no longer form a valid palindrome.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in a given string by iteratively expanding potential palindromic centers.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function recursively expands palindromic substrings from single/dual-character centers outward, returning the longest valid palindrome found during expansion.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by starting with all 1 and 2-character palindromes and recursively expanding them outward until no longer palindromes can be formed, then returns the longest one found.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by recursively expanding from all possible palindrome centers (single characters and matching pairs) outward until no further expansion is possible.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in a given string by dynamically expanding and validating potential palindromes.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all single and double-character palindromes and repeatedly expanding them outwards until they can no longer form a valid palindrome.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring within a given string by expanding potential palindromic centers outward until no further expansion is possible.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function recursively expands valid palindromic substrings from length 1/2 seeds to efficiently identify the longest palindrome.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all 1 and 2-character palindromes and iteratively expanding them outward until no further expansion is possible.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with all 1 and 2-character palindromes and recursively expanding them outward until the longest possible palindrome is found.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by progressively extending valid smaller palindromes and returning the maximum length one found.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all base palindromes of length one and two and repeatedly attempting to expand them outwards until no larger palindromes can be formed.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in a given input string by expanding initial single- and double-character palindromes outward as far as possible.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='The function recursively expands palindromic substrings from smallest units to find the longest palindrome by checking boundary character matches at each step.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by recursively expanding smaller palindromes outward until the maximum length is reached.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with all 1 and 2 character palindromes and recursively extending them outward until no further valid extensions are possible.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by progressively expanding valid palindrome candidates from single/dual character centers until no further expansions are possible.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='The function recursively identifies the longest palindromic substring by starting with all base palindromes of length one and two and iteratively expanding them outwards as long as they remain palindromic.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within a given string by recursively expanding potential palindromes from the center outwards.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function efficiently finds the longest palindromic substring by iteratively expanding validated smaller palindromes.'

Meaningless Names¶
In [16]:
longest_palindromic_susbtring_meaningless_names_results = explain_code(
    "lps-meaningless-names",
    longest_palindromic_substring_meaningless_names,
    longest_palindromic_substring_valid_re,
    longest_palindromic_substring_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in a given string using an iterative expansion approach with recursion.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input string using a recursive expansion algorithm that starts with 1 and 2-character palindromes and iteratively expands them outward.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring within the input string `a`.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary="This function recursively identifies the longest palindromic substring within an input string using an 'expand from center' strategy."

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function returns the longest palindromic substring within the given string `a`.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary='The function returns the longest palindromic substring in the input string.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in a given string by starting with all possible 1 and 2-character palindromes and expanding them outward until no further expansion is possible.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input string using a recursive center-expansion algorithm.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in the input string by progressively expanding from single/double-character seeds.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='The function recursively finds the longest palindromic substring by starting with all length-1 and length-2 palindromes and iteratively expanding them until no longer palindromes can be found.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within a given string.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='The function recursively expands palindrome centers in a string to find the longest palindromic substring.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in a given string using a recursive dynamic programming approach that expands from smaller palindromes to larger ones.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function implements a center-expansion algorithm to find and return the longest palindromic substring in the input string.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in the input string.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring within an input string by recursively expanding from an initial set of all length-1 and length-2 palindromes.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring of the input string.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='The function efficiently computes the longest palindromic substring via center expansion and recursive candidate validation.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in a given string using an iterative expansion approach that starts with single characters and pairs, then recursively expands valid palindromes outward until no further expansion is possible.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in a given string by starting with small valid substrings and iteratively expanding them outward while maintaining the palindrome property.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in the input string by expanding from single and double-character seeds.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring of a given string by starting with all 1 and 2-character palindromes and progressively expanding them outwards until no further expansion is possible.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within the given string `a`.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='The function recursively identifies the longest palindromic substring by dynamically expanding and filtering candidate substrings from initial seeds of length 1 and 2.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function implements a dynamic programming algorithm to find and return the longest palindromic substring within a given input string.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function implements a recursive algorithm to find the longest palindromic substring in a string by starting with small palindromes and attempting to expand them outward until no further expansion is possible.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in the input string `a` by progressively expanding and validating potential palindromes from single and double-character seeds.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='The function finds the longest palindromic substring within a given string by recursively identifying all palindromes of length 1 and 2 and then repeatedly trying to expand them outwards.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in a given string.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function returns the longest palindromic substring in the input string by iteratively expanding validated palindrome seeds.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input string using an iterative expansion algorithm with recursion.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function implements an iterative center-expansion algorithm to find and return the longest palindromic substring within the input string.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in the input string.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='The function finds the longest palindromic substring in an input string by recursively identifying and expanding all smaller palindromes from their centers outward until no further expansion is possible.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='The function finds and returns the longest palindromic substring in the input string.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function identifies the longest contiguous palindromic substring within the input string.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within the input string using an iterative expansion approach.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input string using a recursive expansion algorithm that grows palindromes from their centers outward.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='The function recursively finds and returns the longest palindromic substring in the input string by progressively expanding valid palindromic seeds.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='The function finds the longest palindromic substring within a given string by recursively expanding from initial one and two-character palindromes.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function returns the longest palindromic substring of its input string.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='The function returns the longest palindromic substring in the input string.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in a given string using an iterative expansion approach with recursion.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input string using an iterative expansion algorithm that starts with single characters and pairs, then repeatedly extends valid palindromes outward until no further expansion is possible.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring within the input string.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='The function finds the longest palindromic substring by recursively identifying all palindromes of a certain length and using them as seeds to find larger ones, stopping when no further expansion is possible.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring of the input string.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function identifies the longest contiguous palindromic substring in a given string using recursive center-expansion with endpoint validation.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within the input string using an iterative expansion algorithm.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function implements a center-expansion algorithm to find and return the longest palindromic substring within the input string.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring within the input string.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring of a given string by starting with all palindromes of length 1 and 2 and repeatedly expanding them outwards until no longer palindrome can be found.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='The function finds and returns the longest palindromic substring within the given string.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='The function uses recursive center-expansion to find the longest palindromic substring in the input string.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function implements a recursive algorithm that finds and returns the longest palindromic substring in the input string by expanding from smaller palindromes outward.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input string using an iterative center-expansion algorithm.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring within the input string by expanding from single/dual-character seeds.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='The function recursively finds the longest palindromic substring within a given string by starting with all single-character and two-character palindromes and progressively expanding them outwards until no further expansion is possible.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function returns the longest palindromic substring within the given string.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='The function implements a recursive expansion strategy to identify the longest palindromic substring by validating and growing candidate substrings from single/double-character seeds.'

Ignore Names¶
In [17]:
longest_palindromic_susbtring_ignore_names_results = explain_code(
    "lps-ignore-names",
    longest_palindromic_substring_meaningful_names,
    longest_palindromic_substring_valid_re,
    longest_palindromic_substring_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
    extra_instructions=ignore_names,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within a given text by starting with small palindromes and recursively extending them outward while checking if the first and last characters match.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by iteratively expanding from all possible centers (single characters and adjacent pairs) outward, checking if characters match at each extension step.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by expanding outwards from all possible 1-length and 2-length palindromic substrings.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary='The function recursively finds the longest palindromic substring by starting with all length-1 and length-2 palindromes and repeatedly expanding them outwards in each recursive step until no further expansion is possible.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={}
  summary='This function recursively expands all possible palindromic substrings from every center in the input string, ultimately returning the longest palindromic substring found.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={}
  summary='The function uses recursive center-expansion to efficiently find the longest palindromic substring in a string, prioritizing character matching and boundary checks.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by iteratively extending valid palindromes from their centers outward until no further extension is possible.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input text using an iterative expansion approach that starts with small palindromes and extends them outward until no further extensions are possible.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given string by recursively expanding and validating potential palindrome centers.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all palindromes of length 1 and 2 and repeatedly expanding them outwards until no further expansion is possible.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring within a given string by expanding all possible centers and comparing edge characters outward.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={}
  summary='The function recursively expands palindromic substrings from single/double-character bases, returning the longest validated palindrome when no further expansions are possible.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within the input text using an iterative expansion approach that starts from single characters and matching pairs, then repeatedly extends them outward while checking if the extended substrings remain palindromes.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in a given string by starting with small palindromes and recursively extending them outward until no valid extensions remain.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given string by iteratively expanding smaller palindromic substrings and checking for validity.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by starting with all length-1 and length-2 palindromes and recursively expanding them outwards, at each step keeping only the newly formed, longer palindromes, until no more can be found.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function finds and returns the longest substring within a given string that reads the same forward and backward by iteratively expanding candidate palindromic substrings until they can no longer be grown.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={}
  summary='This function implements a recursive center-expansion algorithm to identify the longest palindromic substring by progressively extending valid palindrome centers until no further expansions are possible, then selecting the longest found substring.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within the input text by iteratively expanding from palindrome centers outward.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with single characters and identical pairs, then iteratively extending them outward while checking that the outermost characters match, using recursion to continue until no further extensions are possible.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by progressively extending potential palindrome candidates and verifying their validity.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by first identifying all single and double-character palindromes, and then recursively expanding them outwards until they can no longer form a valid palindrome.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest palindromic substring in a given string by iteratively expanding all current candidate palindromic substrings outward in parallel until no further expansions are possible.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={}
  summary='The function finds the longest palindromic substring by recursively expanding valid initial palindromic seeds (singles/pairs) outward until no expansions are possible.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within the input text by iteratively expanding from center positions outward and checking character matches.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input text by starting with small palindromes and iteratively extending them outward until no further valid extensions are possible.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring in a given string by progressively expanding potential palindrome centers and checking for valid extensions.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by first identifying all 1 and 2-character palindromes and then repeatedly expanding them outwards until no further valid expansion is possible.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within a given string using a recursive approach that iteratively extends candidate palindromic centers outward as far as possible.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={}
  summary='This function recursively identifies the longest palindromic substring by progressively expanding valid palindromic seeds from single/double-character bases until no further symmetric expansions are possible.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with single characters and matching pairs, then iteratively extending them outward while maintaining the palindrome property until no further extensions are possible.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with single characters and equal adjacent pairs, then recursively extending them outward while the boundary characters match.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given string by recursively expanding smaller palindromic substrings and checking for matching outer characters.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all single-character and two-character palindromes and repeatedly expanding them outwards until no further expansion is possible.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in a given string by iteratively expanding all potential centers outwards and selecting the maximal valid palindrome when no further expansion is possible.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={}
  summary='This function recursively expands validated palindromic substrings outward until no expansions are viable, then returns the longest candidate.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with single characters and matching pairs, then iteratively extending them outward while the characters match, ultimately returning the longest valid palindrome found.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by iteratively extending smaller palindromes outward and checking if the extended substrings remain palindromes.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function finds the longest contiguous palindromic substring in a given text by recursively extending potential palindromes from single and double-character seeds while verifying matching end characters at each expansion step.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by recursively identifying all palindromes of a certain length and then attempting to expand them outwards by two characters until no further valid expansion is possible.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={}
  summary='This function recursively expands all potential palindromic substrings by growing from the center(s) outward, repeatedly filtering for valid palindromes, until it finds and returns the longest palindromic substring within the given input string.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={}
  summary='The function finds the longest palindromic substring by recursively expanding valid palindromic seeds (single characters or adjacent pairs) outward until no further symmetric character matches are possible.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given string by starting with all single characters and matching pairs, then iteratively expanding them outward while the boundary characters match, ultimately returning the longest palindrome found.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a text by starting with small palindromes (single characters and identical pairs) and recursively extending them outward until no more valid extensions exist.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given string by recursively expanding potential palindrome centers while verifying matching outer characters.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring by recursively identifying all short palindromes of length 1 and 2 and then repeatedly expanding them outwards until no further palindromic extensions are possible.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function finds and returns the longest contiguous substring of the input text that reads the same forwards and backwards (i.e., the longest palindromic substring).'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={}
  summary='This function uses recursive expansion from initial palindrome seeds to find the longest palindromic substring but has critical limitations in handling odd-length palindromes and boundary cases, potentially missing valid results.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring in the input text by iteratively extending smaller palindromes outward until no further valid extensions are possible.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with single characters and matching pairs, then recursively extending them outward while the boundary characters match.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given string by iteratively extending smaller palindromes and checking for validity.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all 1 and 2-character palindromes and iteratively expanding them outwards until no further expansion is possible.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within a given input string by recursively expanding all possible palindromic centers and selecting the largest result.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={}
  summary='The function implements a bottom-up, recursive center-expansion strategy to efficiently discover the longest palindromic substring by validating endpoint symmetry at each expansion step.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function finds and returns the longest palindromic substring within the input text by iteratively extending palindrome candidates outward from their centers.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function finds the longest palindromic substring in a given text by starting with all 1 and 2 character palindromes and iteratively extending them outward until no further valid extensions are possible.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='The function finds the longest palindromic substring in `text` by dynamically extending small palindromic seeds outward until no further valid extensions are possible.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function recursively finds the longest palindromic substring by starting with all single and double-character palindromes and repeatedly expanding them outwards until no further expansion is possible.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={}
  summary='This function recursively finds and returns the longest substring of the input that is a palindrome by successively expanding matching-character substrings outwards as far as possible.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={}
  summary='The function finds the longest palindromic substring by recursively expanding validated palindromic seeds (single characters and adjacent pairs) until no further expansions are possible.'

Results¶
In [18]:
plot_results(
    "Longest Palindromes",
    longest_palindromic_susbtring_meaningful_names_results,
    longest_palindromic_susbtring_meaningless_names_results,
    longest_palindromic_susbtring_ignore_names_results,
    include_lengths=True,
)
No description has been provided for this image

Twin Primes¶

Find all the twin primes ($p, q$ pairs of prime numbers where $q = p + 2$) up to the given limit, and return the smaller one from each pair.

A possible implementation:

In [19]:
def find_smaller_twin_primes_basic(limit):
    prev = 2
    result = []

    for candidate in range(3, limit, 2):
        for modulus in range(3, int(candidate ** 0.5) + 2, 2):
            if candidate % modulus == 0:
                break
        else:
            if candidate == prev + 2:
                result.append(prev)

            prev = candidate

    return result


def test_find_smaller_twin_primes(func):
    assert_eq([3, 5, 11, 17, 29, 41, 59, 71, 101], func(105))


test_find_smaller_twin_primes(find_smaller_twin_primes_basic)

Complication: if $p, q \in \mathbb{N}$ and $q \le \left\lfloor \sqrt{p} \right\rfloor$ and $\text{gcd}(p, q) > 1$ (where $\text{gcd}(p, q)$ denotes the greatest common divisor of $p$ and $q$), then $p$ cannot possibly be a prime number. Moreover, given $p \in \mathbb{N}$, if for all $q \in \mathbb{N}$ where $q \le \left\lfloor \sqrt{p} \right\rfloor$, the $\text{gcd}(p, q) = 1$ equality holds, then $p$ must be a prime number. Therefore, instead of testing possible divisors of a cancidate number, Euclid's algorithm can be used for the primality test, without performing any divisions.

In [20]:
def find_smaller_twin_primes(limit):
    smaller_twin_primes = []
    previous_prime = 2

    for candidate in range(3, limit, 2):
        for number in range(3, int(candidate ** 0.5) + 2):
            greatest_common_divisor = candidate

            while number > 0:
                greatest_common_divisor -= number

                if number > greatest_common_divisor:
                    number, greatest_common_divisor = greatest_common_divisor, number

            if greatest_common_divisor > 1:
                break
        else:
            if candidate == 2 + previous_prime:
                smaller_twin_primes.append(previous_prime)

            previous_prime = candidate

    return smaller_twin_primes


def f(a):
    b = []
    c = 2

    for d in range(3, a, 2):
        for e in range(3, int(d ** 0.5) + 2):
            g = d

            while e > 0:
                g -= e

                if e > g:
                    e, g = g, e

            if g > 1:
                break
        else:
            if d == 2 + c:
                b.append(c)

            c = d

    return b


test_find_smaller_twin_primes(find_smaller_twin_primes)
test_find_smaller_twin_primes(f)

twin_primes_valid_re = re.compile(
    r"(twin[ -]primes*)|(prime twins*)|(all primes.*[+] *2.*both.*primes*)",
    re.IGNORECASE,
)

twin_primes_invalid_ptrns = {
    "suspects_bug": re.compile(
        (
            r"(contains a( critical)? bug)"
            r"|((error|bug)s* (that|which) prevents*)"
            r"|((mathematical|calculation|implementation) (error|bug)s*)"
            r"|(bug in (the|its) (calculation|primality testing))"
            r"|(appears to be incorrect)"
            r"|(flawed(| implementation of| method for checking) (primality|prime-checking))"
            r"|(somewhat flawed method)"
            r"|(critical bugs* in its primality testing)"
            r"|(prevents* it from working correctly)"
            r"|(it has a bug)"
            r"|(unreliable for larger numbers)"
            r"|(with a flawed method)"
            r"|(incorrect custom primality test)"
            r"|(which makes it unreliable)"
            r"|(will produce incorrect results)"
        ),
        re.IGNORECASE,
    ),
}

twin_primes_meaningful_names = inspect.getsource(find_smaller_twin_primes)
twin_primes_meaningless_names = inspect.getsource(f)
Meaningful Names¶
In [21]:
twin_primes_meaningful_names_results = explain_code(
    "tp-meaningful-names",
    twin_primes_meaningful_names,
    twin_primes_valid_re,
    twin_primes_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(106, 127), match='flawed prime-checking'>}
  summary='This function attempts to find the smaller primes in twin prime pairs up to a given limit, but contains a flawed prime-checking algorithm that incorrectly implements GCD calculation instead of proper divisibility testing.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies and returns a list of the smaller prime numbers from all twin prime pairs (primes that differ by 2) less than the specified limit, using a GCD-based primality test.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of smaller primes from all twin prime pairs (primes differing by 2) up to a given limit.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies all twin prime pairs below a given `limit` and returns a list containing the smaller prime from each pair, using a highly inefficient method for its primality test.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the smaller numbers in all twin prime pairs (where both `p` and `p+2` are prime) below a specified limit, using a convoluted method for checking primality.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies the smaller primes in all twin prime pairs below a given limit using an inefficient GCD-based primality test.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(109, 125), match='flawed primality'>}
  summary='This function attempts to find the smaller prime in each twin prime pair up to a given limit, but contains a flawed primality test that uses GCD calculations instead of proper divisibility checks, making it unreliable for correctly identifying prime numbers.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all smaller members of twin prime pairs below a given limit using an inefficient GCD-based primality test.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes differing by 2) below a given limit using an unconventional prime-checking method.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies all twin prime pairs below a given limit by iterating through odd numbers, using a convoluted subtraction-based algorithm to test for primality, and returns a list containing the smaller prime from each pair found.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(117, 153), match='flawed method for checking primality'>}
  summary='The function attempts to return a list of all smaller primes in twin prime pairs less than a given limit, but uses a flawed method for checking primality.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies the smaller primes in twin prime pairs (e.g., 3, 5) below a given limit using a primality test based on GCD subtraction.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(110, 126), match='flawed primality'>}
  summary='This function attempts to find the smaller number in each twin prime pair up to a given limit, but contains a flawed primality test using an incorrect GCD calculation that will miss many twin primes.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs (consecutive primes that differ by 2) up to a given limit and returns a list containing the smaller prime from each pair.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all smaller primes in twin prime pairs (primes differing by 2) up to a given `limit`.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all twin prime pairs below a specified limit by iteratively testing odd numbers for primality using an inefficient GCD algorithm and returns a list of the smaller prime from each identified pair.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns all primes less than a given limit that are the smaller member of a twin prime pair (i.e., all `p` such that both `p` and `p+2` are primes), using a custom primality test based on a subtraction-based GCD algorithm.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies smaller primes in twin prime pairs (e.g., 3 in (3,5)) below a given limit using an unconventional primality test.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(174, 208), match='flawed implementation of primality'>}
  summary='This function attempts to find all the smaller primes in twin prime pairs (consecutive primes differing by 2) up to a given limit, though it uses an inefficient and somewhat flawed implementation of primality testing via GCD calculations.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns a list of the smaller members of all twin prime pairs (primes that differ by 2) up to a given limit, using an inefficient primality test based on GCD calculations.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes differing by 2) below a given limit using a basic primality test.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies all twin prime pairs below a given `limit` by using an inefficient primality test and returns a list of the smaller prime from each discovered pair.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the smaller prime in each twin prime pair less than the given limit, using a custom (but inefficient) prime check approach.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns all smaller primes from consecutive twin prime pairs (p, p+2) where p+2 is below the input limit.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(109, 125), match='flawed primality'>}
  summary='This function attempts to find the smaller prime in each twin prime pair up to a given limit, but contains a flawed primality test that uses an incorrectly implemented GCD algorithm instead of proper divisibility checking.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies and returns the smaller prime number from each twin prime pair (primes that differ by 2) up to a specified limit, using an unusual GCD-based primality test.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes differing by 2) below a given limit using a GCD-based primality check.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds the smaller prime of each twin prime pair up to a specified limit by inefficiently testing odd numbers for primality using a subtraction-based GCD algorithm and then checking if consecutive primes differ by two.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all smaller primes from twin prime pairs less than the given limit, though it uses an extremely inefficient and non-standard method to check for primality.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns the smaller primes of all twin prime pairs (p, p+2) with p+2 < limit.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(109, 125), match='flawed primality'>}
  summary='This function attempts to find the smaller prime in each twin prime pair up to a given limit, but contains a flawed primality test that uses an incorrect GCD-based approach instead of proper divisibility checking.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies and returns all prime numbers that are the smaller member of a twin prime pair (two primes differing by 2) below a specified limit, using a custom primality test based on GCD calculations.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes differing by 2) up to a given limit using an unconventional GCD-based primality check.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs below a given `limit` by using an inefficient primality test based on the Euclidean algorithm and returns a list containing the smaller prime of each pair.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The `find_smaller_twin_primes(limit)` function returns a list of the smaller numbers in all twin prime pairs below `limit`, using a nonstandard method to check for primes.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function generates smaller primes from twin prime pairs below a limit by identifying consecutive primes with a gap of 2 using a GCD-based primality test.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all prime numbers that are the smaller member of a twin prime pair (two primes that differ by 2) up to a given limit, though it uses an unconventional and inefficient primality testing algorithm.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies and returns the smaller member of each twin prime pair (consecutive primes differing by 2) up to a specified limit, using an inefficient but functional primality test based on GCD calculations.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes differing by 2) below a given limit using a subtraction-based GCD method for primality testing.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function iterates through odd numbers up to a specified `limit`, determines if they are prime using an inefficient method, and returns a list containing the smaller prime from every twin prime pair found.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns all smaller members of twin prime pairs (primes p for which both p and p+2 are prime) up to a given limit.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes all smaller primes in twin prime pairs below a given limit using a GCD-based primality test and pairwise difference checks[1][2][3][4][5].'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(117, 133), match='flawed primality'>}
  summary='This function attempts to find the smaller primes in twin prime pairs up to a given limit, but contains a critically flawed primality test that will produce incorrect results.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns a list of the smaller prime numbers from all twin prime pairs (primes that differ by 2) up to a given limit, though it uses an inefficient primality testing method.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of smaller primes in all twin prime pairs (primes differing by 2) up to a given limit.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs below a given limit by iterating through odd numbers, using a convoluted GCD algorithm to test for primality, and returns a list containing the smaller prime from each pair discovered.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function attempts to return a list of the smaller prime from each twin prime pair below a given limit, using an unusual and inefficient method for primality checking.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns the smaller primes from consecutive twin prime pairs (p, p+2) below a given limit using an inefficient primality test.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(139, 155), match='flawed primality'>}
  summary='This function attempts to find the smaller prime in each pair of twin primes (primes that differ by 2) up to a given limit, but contains a flawed primality checking algorithm that incorrectly uses a malformed GCD calculation instead of proper divisibility testing.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies and returns the smaller member of each twin prime pair (primes that differ by 2) up to a specified limit by using a GCD-based primality test.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes differing by 2) up to a given limit using an unconventional primality test.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function iterates through numbers up to a given `limit` to find all twin prime pairs, returning a list containing the smaller prime of each pair, though it does so using a highly inefficient and unusual algorithm for primality testing.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns a list of the smaller primes from each twin prime pair less than the specified limit, using a non-standard method to check for primality.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(84, 105), match='flawed prime-checking'>}
  summary='The function aims to find smaller primes in twin prime pairs below a limit, but its flawed prime-checking algorithm may produce incorrect results.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(119, 135), match='flawed primality'>}
  summary='This function attempts to find and return the smaller primes from twin prime pairs up to a given limit, but contains a flawed primality test implementation using an incorrect GCD calculation method.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs (primes that differ by 2) up to a given limit and returns a list containing the smaller prime from each pair.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes differing by 2) below a given limit.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies all twin prime pairs `(p, p+2)` where the larger prime is less than a given `limit` by using an inefficient primality test, and then returns a list containing only the smaller prime `p` from each discovered pair.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function inefficiently finds and returns a list of all "smaller" twin primes (the lower member of each twin prime pair) less than a given limit.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of the smaller primes in all twin prime pairs \\((p, p+2)\\) where \\(p+2 < \\text{limit}\\).'

Meaningless Names¶
In [22]:
twin_primes_meaningless_names_results = explain_code(
    "tp-meaningless-names",
    twin_primes_meaningless_names,
    twin_primes_valid_re,
    twin_primes_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all the smaller members of twin prime pairs (consecutive odd primes that differ by 2) less than the input value `a`.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs (consecutive primes differing by 2) where both primes are less than the input value `a`, and returns a list containing the smaller prime from each pair.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin primes (pairs of primes differing by 2) below a given number `a` and returns the smaller prime of each pair.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies all twin prime pairs up to a given number `a` and returns a list containing the first prime of each pair.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the smaller primes from all twin prime pairs less than the input value `a`.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of the smaller primes in twin prime pairs within the range `[3, a-2]`.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all prime numbers less than `a` that are the first element of a twin prime pair (primes where the next prime is exactly 2 greater).'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns the smaller prime in each twin prime pair (pairs of primes that differ by 2) up to a given limit.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of the smaller primes in all twin prime pairs (primes separated by 2) below the given input number `a`.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function identifies all twin prime pairs below the input number 'a' and returns a list containing the first prime number from each of those pairs."

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of all primes less than `a` that are the smaller number in each twin prime pair.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of the smaller primes in every twin prime pair within the range `[3, a)`.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(112, 126), match='contains a bug'>}
  summary='This function attempts to find all smaller members of twin prime pairs less than the input value `a`, though it contains a bug in its primality testing logic.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs where both primes are less than the input value `a`, and returns a list containing the first (smaller) prime from each pair.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs below a given number `a` and returns a list of the smaller primes in each pair.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all numbers `p` that are the first in a twin prime pair (where `p` and `p+2` are both prime) below the given input limit `a`.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the first primes in each twin prime pair (pairs of primes that differ by 2) less than the given number `a`.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of primes that are the smaller members of consecutive twin prime pairs in the range [3, a).'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all prime numbers less than `a` that are the first member of a twin prime pair (two consecutive odd primes that differ by 2).'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all the smaller members of twin prime pairs (consecutive primes that differ by 2) less than the input value a.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of the smaller primes in twin prime pairs (primes separated by 2) below the input number `a`.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies all twin prime pairs less than the input `a` and returns a list containing the first prime number of each pair.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all numbers less than `a` that are the lower member of a twin prime pair and returns them in a list.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of the smaller primes in twin prime pairs `(p, p+2)` for primes `p` in the range `[3, a)`.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all prime numbers less than `a` where the prime + 2 is also prime (the first element of twin prime pairs).'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns the smaller prime from each twin prime pair (consecutive primes differing by 2) where both primes are less than the input value a.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all primes below `a` that are the smaller member of a twin prime pair (where p and p+2 are both prime).'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs up to a given limit `a` and returns a list containing the first prime number of each discovered pair.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of all lower twin primes less than a given number `a`, i.e., primes `p` such that both `p` and `p+2` are prime and `p < a`.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the smaller primes in consecutive twin prime pairs (p, p+2) where both primes are less than `a`.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(101, 117), match='flawed primality'>}
  summary='This function attempts to find the smaller members of twin prime pairs less than `a`, but contains a flawed primality test that uses GCD calculations instead of proper divisibility checks.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns a list of all prime numbers p where both p and p+2 are prime (twin primes), with p+2 < a.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns all smaller primes in twin prime pairs (primes separated by 2) up to a given number `a`.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function searches for twin primes up to a given limit `a` and returns a list containing the first prime of each discovered pair.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the smaller prime number in each pair of consecutive primes (called "twin primes") that differ by 2, up to a specified limit.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of the smaller primes in twin prime pairs for odd numbers between 3 and the input `a`.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns a list of all prime numbers that are the smaller member of a twin prime pair (primes that differ by 2) less than the input value `a`.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns all the smaller numbers in twin prime pairs (consecutive odd primes differing by 2) that are less than the input parameter `a`.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs below a given number `a` and returns a list of the smaller primes in each pair.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies all twin prime pairs up to a given limit `a` and returns a list containing the first prime of each discovered pair.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns a list of all primes less than `a` that are the smaller member of a twin prime pair (i.e., it finds all twin primes below `a`).'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of the smaller primes in consecutive twin prime pairs within the range `[3, a-1]`.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all smaller primes in twin prime pairs (primes that differ by 2) less than the input value `a`.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns all prime numbers less than `a` that form twin prime pairs (primes where p and p+2 are both prime).'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns all smaller primes that form twin prime pairs (primes differing by 2) below the given input number `a`.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies all twin prime pairs `(p, p+2)` where `p+2` is less than the input `a`, and returns a list containing the first prime `p` from each of those pairs.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the lower member of each twin prime pair less than the input value `a`.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of smaller primes from consecutive twin prime pairs within the range `[3, a-1]`.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns all prime numbers p where both p and p+2 are prime (twin primes), returning the smaller prime from each pair for all such pairs where both primes are less than the input value a.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns a list of all prime numbers p where both p and p+2 are prime (the first element of each twin prime pair) for all such primes less than the input value a.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of all smaller primes in twin prime pairs (primes separated by 2) up to the input number.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all twin prime pairs up to a given limit `a` and returns a list containing the first prime number of each pair.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns the lower numbers (the first elements) of all twin prime pairs less than the input `a`.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of primes that are the smaller of consecutive twin prime pairs (differing by 2) below the input value `a`.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all prime numbers p where both p and p+2 are prime (the smaller members of twin prime pairs) for all primes less than the input value a.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs (consecutive primes that differ by 2) less than the input value `a` and returns a list of the smaller prime from each pair.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns all smaller primes that form twin prime pairs (primes differing by 2) below the given input number.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies all twin prime pairs where both primes are less than the input `a`, and returns a list containing the first prime of each of those pairs.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function `f(a)` returns a list of all primes less than `a` that are the smaller member of a twin prime pair.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of smaller primes from twin prime pairs (differing by 2) found between 3 and `a-1`.'

Ignore Names¶
In [23]:
twin_primes_ignore_names_results = explain_code(
    "tp-ignore-names",
    twin_primes_meaningful_names,
    twin_primes_valid_re,
    twin_primes_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
    extra_instructions=ignore_names,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns the smaller prime from each pair of twin primes (consecutive primes that differ by 2) up to the given limit, using a GCD-based primality test.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all smaller members of twin prime pairs (consecutive primes differing by 2) less than the given limit by using a GCD-based primality test on odd numbers.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all pairs of prime numbers (twin primes) where they differ by 2, up to a given limit, and returns the smaller prime from each pair.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies all twin prime pairs below a specified limit by iterating through odd numbers, checking for primality using an unconventional GCD calculation, and returns a list containing the smaller prime from each discovered pair.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns all the smaller numbers of twin prime pairs less than the given limit, using an unconventional and inefficient method to check for primality.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of smaller primes from twin prime pairs (p, p+2) where p+2 is a prime below the input `limit`, using an inefficient primality test.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies twin prime pairs (primes that differ by 2) up to a given limit and returns a list of the smaller prime from each pair, using an inefficient GCD-based primality test.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns all prime numbers that are the smaller member of a twin prime pair (primes that differ by exactly 2) up to a given limit, using a GCD-based primality test.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs below a given limit and returns the smaller prime from each pair.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies all twin prime pairs up to a specified limit by iterating through odd numbers, using an unconventional GCD-based primality test, and returns a list containing the smaller prime of each pair found.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function inefficiently finds and returns all the smaller members of twin prime pairs less than a given limit.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns smaller primes from twin prime pairs below a given limit through iterative primality checks and consecutive prime tracking.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns the smaller prime from each twin prime pair (consecutive primes differing by 2) below the given limit, using an inefficient primality test based on GCD calculations.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all the smaller members of twin prime pairs below a given limit by implementing primality testing through GCD calculations and checking if consecutive primes differ by exactly 2.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a list of smaller primes from twin prime pairs (primes differing by 2) below a given limit, using an unconventional GCD-based prime-checking method.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies twin prime pairs up to a given limit by iterating through odd numbers, confirming their primality using an inefficient subtraction-based algorithm to find the greatest common divisor, and then stores the smaller prime of each identified pair in a list.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(200, 222), match='somewhat flawed method'>}
  summary='This function attempts to find all primes up to a given limit where the next prime is exactly two larger (so-called "twin primes") and returns a list of their smaller members, using a nonstandard and somewhat flawed method for prime testing.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies smaller primes in twin prime pairs below a given limit using an inefficient GCD-based primality test.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(103, 141), match='critical bugs in its primality testing'>}
  summary='This function attempts to find the smaller prime in twin prime pairs up to a given limit, but contains critical bugs in its primality testing algorithm that prevent it from working correctly.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all smaller members of twin prime pairs (consecutive primes differing by 2) below a given limit using a GCD-based primality test.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all twin prime pairs (primes with a difference of 2) below a given `limit` and returns a list of the smaller primes in each pair, using an inefficient prime-checking method.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs up to a specified limit and returns a list containing the smaller prime from each pair.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the smaller numbers from each pair of twin primes (primes that differ by 2) below the given limit.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function iterates through odd candidates, uses a GCD-based primality test to find twin prime pairs, and returns the smaller prime from each pair below the specified limit.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(132, 144), match='it has a bug'>}
  summary='This function finds all prime numbers p where p+2 is also prime (twin primes) and returns the smaller number from each pair, though it has a bug that incorrectly includes 2 in the results.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all prime numbers p where p+2 is also prime (twin primes) and returns a list of these smaller primes p up to the given limit.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all twin prime pairs below a given limit and returns a list of the smaller prime in each pair, using an unconventional prime-checking algorithm.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies all twin prime pairs up to a specified `limit` and returns a list containing the smaller prime of each pair, notably using a highly inefficient, subtraction-based version of the Euclidean algorithm to perform its primality tests.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function attempts to find and return all prime numbers less than a given limit for which both the number and the number plus two are prime, i.e., the smaller number in each twin prime pair.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns all smaller primes in twin-prime pairs (p, p+2) below a given limit using repeated subtraction for GCD-based primality checks.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(106, 122), match='flawed primality'>}
  summary='This function attempts to find the smaller primes in twin prime pairs up to a given limit, but contains a flawed primality test that uses GCD calculations instead of proper divisibility checking, making it unreliable for larger numbers.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs (primes that differ by 2) up to a given limit and returns a list containing the smaller prime from each pair.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes differing by 2) up to a given limit, using an inefficient GCD-based primality test.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies the smaller prime in each twin prime pair up to a given limit by iterating through odd numbers and using an inefficient, subtraction-based GCD algorithm to perform primality tests.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of the smaller values in each pair of twin primes (i.e., consecutive primes differing by 2) less than the given limit, using a convoluted method to check for primality.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns all smaller primes in consecutive twin prime pairs (p, p+2) below a given limit, using a GCD-based primality test and tracking immediate prime predecessors.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns the smaller prime in each twin prime pair (primes p where p+2 is also prime) up to the given limit, using an inefficient primality test that employs repeated subtraction instead of the modulo operator.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies twin prime pairs (consecutive primes differing by 2) up to a given limit and returns the smaller prime from each pair.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all twin prime pairs below a given limit and returns the smaller prime from each pair.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies all twin prime pairs below a given limit by iteratively checking odd numbers for primality and then returns a list containing only the smaller prime from each discovered pair.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(21, 41), match='with a flawed method'>}
  summary='This function tries (with a flawed method) to collect all numbers less than the given limit that are the smaller number in a pair of twin primes (primes that differ by 2), but it uses an incorrect custom primality test which makes it unreliable.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the smaller primes of twin prime pairs up to a specified limit using GCD-based primality checks and twin-pair validation.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(146, 162), match='flawed primality'>}
  summary='This function attempts to find the smaller primes in twin prime pairs (primes that differ by 2) up to a given limit, but contains a fundamentally flawed primality testing algorithm that will produce incorrect results.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all twin prime pairs (p, p+2) where both numbers are prime and less than the given limit, returning a list of the smaller prime from each pair.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all twin prime pairs below a given limit and returns the smaller prime from each pair.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies all twin prime pairs up to a given limit by using an inefficient, subtraction-based GCD algorithm to test for primality and then appends the smaller prime of each discovered pair to a list.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all odd numbers less than a given limit that, together with the next odd number above them, form twin primes—that is, it gives the smaller member of each twin prime pair found below the limit.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies smaller primes in twin prime pairs (p, p+2) where p+2 < limit, using a primality test based on GCD-by-subtraction.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(124, 138), match='contains a bug'>}
  summary='This function attempts to find the smaller prime in each twin prime pair (primes that differ by 2) up to a given limit, but contains a bug that skips checking divisibility by 2, leading to incorrect results for even numbers.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all pairs of prime numbers that differ by exactly 2 (twin primes) up to a given limit and returns a list containing only the smaller prime from each pair.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all pairs of twin primes (primes differing by 2) below a given limit and returns the smaller prime from each pair.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies all twin prime pairs below a given `limit` by iterating through odd numbers, checking for primality using a subtraction-based GCD algorithm, and returns a list containing the smaller prime of each discovered pair.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all numbers less than `limit` which are the smaller in a pair of "twin primes" (pairs of primes that differ by 2), using an inefficient nonstandard primality check based on repeated subtraction and coprimality.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all primes \\( p \\) below `limit` where \\( p \\) and \\( p+2 \\) are both prime, returning \\( p \\) in a list.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds and returns the smaller prime in each twin prime pair (consecutive odd primes differing by 2) up to a given limit, using a subtraction-based GCD algorithm for primality testing.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function returns a list of all prime numbers less than the limit that have another prime exactly 2 units larger, effectively finding the smaller members of all twin prime pairs below the given limit.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function finds all smaller primes in twin prime pairs (primes with a difference of 2) up to a given limit by manually checking primality using a subtraction-based GCD algorithm.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function identifies the smaller prime of each twin prime pair below a specified limit by sequentially testing odd numbers for primality using a highly inefficient, subtraction-based algorithm to find common divisors.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function finds all the smaller numbers in pairs of "twin primes" less than a given limit, using an unusual GCD-based method to check for primes.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function identifies smaller primes in twin-prime pairs below a limit using a GCD-based primality test, but is computationally inefficient.'

Results¶
In [24]:
plot_results(
    "Twin Primes",
    twin_primes_meaningful_names_results,
    twin_primes_meaningless_names_results,
    twin_primes_ignore_names_results,
    include_lengths=True,
)
No description has been provided for this image

Standardization¶

A common method of normalizing data samples in statistics is to shift and scale the samples so that the population mean becomes 0, and the standard deviation becomes 1.

A pure Python implementation might look like the one below (keeping NaNs and infinities, and leaving the handling of problems of empty lists or 0 variance to the caller):

In [25]:
def standardize_basic(x):
    x_finite = [value for value in x if math.isfinite(value)]
    x_finite_sqr = [value * value for value in x_finite]
    mean = sum(x_finite) / len(x_finite)
    var = sum(x_finite_sqr) / len(x_finite) - mean * mean
    std = var ** 0.5

    return [(value - mean) / std for value in x]


def test_standardization(func):
    assert_eq(
        [-1.5, 1.5, -0.5, float("-inf"), 0.5, 0.0],
        func([-1, 5, 1, float("-inf"), 3, 2]),
    )


test_standardization(standardize_basic)

Complications: the sample variance will be calculated with rearranged mathematical formulas, then Newton's method will be used for computing its square root, the standard deviation.

In [26]:
def standardize(x):
    x_finite = [value for value in x if math.isfinite(value)]
    x_sum = 0.0

    for value in x_finite:
        x_sum += value

    mean = x_sum / len(x_finite)
    mean_sqr = mean * mean
    sum_sqr = 0.0

    for value in x_finite:
        sum_sqr += value * value + mean_sqr

    var = (sum_sqr - 2.0 * mean * x_sum) / len(x_finite)
    std = var / 2.0

    for i in range(10):
        std = (std * std + var) / (std + std)

    return [(value - mean) / std for value in x]


def f(a):
    b = [c for c in a if math.isfinite(c)]
    d = 0.0

    for c in b:
        d += c

    e = d / len(b)
    g = e * e
    h = 0.0

    for c in b:
        h += c * c + g

    i = (h - 2.0 * e * d) / len(b)
    j = i / 2.0

    for k in range(10):
        j = (j * j + i) / (j + j)

    return [(c - e) / j for c in a]


test_standardization(standardize)
test_standardization(f)

std_valid_re = re.compile(
    r"(z.score)|(standardiz[ea])|(normaliz[ea])",
    re.IGNORECASE,
)

std_invalid_ptrns = {
    "suspects_bug": re.compile(
        (
            r"(contains a( critical)? bug)"
            r"|((error|bug)s* (that|which) prevents*)"
            r"|((mathematical|calculation|implementation) (error|bug)s*)"
            r"|(bug in (the|its)( variance)? calculations*)"
            r"|(appears to be incorrect)"
            r"|((flawed|incorrect)(| implementation of| method| measure of| calculation).*(variance|standard deviation|std|algorithm|formulas*|normalization))"
            r"|(non-finite values inconsistent)"
            r"|(mishandles non-finite values)"
            r"|(incorrectly calculated)"
            r"|(incorrect.*method)"
            r"|(critical errors in calculating)"
            r"|((incorrect|unreliable|inaccurate) results*)"
            r"|(results will.*be inaccurate)"
            r"|(somewhat resembling square root refinement)"
            r"|(non-standard formula loosely related to variance)"
            r"|(variance-like statistic)"
            r"|(z-score-like outputs*)"
            r"|(error-prone implementation)"
            r"|(potentially inaccurate method)"
            r"|(mathematically invalid approach)"
            r"|(inaccurate standardized values)"
            r"|(contains critical flaws)"
            r"|(includes a bug)"
            r"|(partially incorrect implementation)"
        ),
        re.IGNORECASE,
    ),
}

std_meaningful_names = "import math\n\n" + inspect.getsource(standardize)
std_meaningless_names = "import math\n\n" + inspect.getsource(f)
Meaningful Names¶
In [27]:
std_meaningful_names_results = explain_code(
    "std-meaningful-names",
    std_meaningful_names,
    std_valid_re,
    std_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(120, 138), match='incorrect variance'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by a value derived from an incorrect variance calculation and a mysterious iterative process, resulting in improperly standardized data.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers by computing the mean and standard deviation from finite values only, then applying the transformation (x - mean) / std to all original values including non-finite ones.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function attempts to standardize a dataset (subtract mean and divide by standard deviation) using a non-standard and mathematically questionable approach to handle finite values while preserving non-finite ones in the output.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by calculating their z-scores, first filtering out non-finite values to compute the mean and population standard deviation, where the standard deviation itself is calculated using an iterative numerical method instead of a direct square root.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(161, 215), match='incorrect formulas for variance and standard devi>}
  summary='This function attempts to standardize a list of numbers by removing non-finite values and transforming the rest to have zero mean and unit variance, but it uses incorrect formulas for variance and standard deviation, making its results unreliable compared to standard scaling methods.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the mean and population standard deviation of finite values in the input list, then standardizes all original values (including non-finite ones) using Z-score normalization.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(142, 161), match='mathematical errors'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by the standard deviation, but contains critical mathematical errors in the variance and standard deviation calculations that prevent it from working correctly.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by subtracting the mean and dividing by the standard deviation (computed using an iterative square root algorithm), while handling non-finite values by excluding them from the statistics calculation but including them (as NaN) in the output.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(149, 179), match='non-finite values inconsistent'>}
  summary='This function attempts to standardize a dataset by subtracting the mean and dividing by an iteratively-calculated standard deviation, while handling non-finite values inconsistently.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by converting them to Z-scores, handling non-finite values correctly, but it notably uses complex and inefficient numerical algorithms to compute the variance and standard deviation instead of direct, standard Python functions.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function removes non-finite values, computes the mean and standard deviation (using a custom iterative square root), and returns a standardized version (z-score) of the input list.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes Z-scores for all input values using the mean and standard deviation derived exclusively from finite values, but may produce undefined results for non-finite inputs or empty finite sets.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(120, 138), match='incorrect variance'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by a value derived from an incorrect variance calculation and a mysterious iterative process, resulting in improperly standardized data.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers by calculating the mean and standard deviation of finite values, then transforming all values (including non-finite ones) using the formula (x - mean) / std, with an unusual iterative implementation for computing the square root.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(146, 176), match='non-finite values inconsistent'>}
  summary='The function attempts to standardize a dataset by subtracting the mean and dividing by a questionable standard deviation estimate, while handling non-finite values inconsistently.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by calculating the z-score for each element, effectively transforming the dataset to have a mean of 0 and a standard deviation of 1, while using unconventional and complex methods for the intermediate statistical calculations.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(182, 237), match='incorrect method for calculating the standard dev>}
  summary='The function attempts to standardize a list of numbers by removing finite non-numeric values and computing what appears to be a Z-score, but it uses a highly non-standard and likely incorrect method for calculating the standard deviation.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes Z-scores for input data by filtering non-finite values, calculating mean and variance, iteratively approximating standard deviation, and standardizing all original values.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(138, 157), match='mathematical errors'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by standard deviation, but contains multiple mathematical errors in computing variance and standard deviation, making it produce incorrect results.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(157, 176), match='implementation bugs'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by the standard deviation (z-score normalization), but contains implementation bugs in the variance calculation that will produce incorrect results.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function attempts to standardize a dataset by subtracting the mean and dividing by standard deviation, using an unconventional calculation method that ultimately converges to correct values, while preserving non-finite inputs in the output.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function standardizes a list of numbers by calculating their Z-scores, notably using a convoluted formula for variance and an iterative numerical method (Newton's method) to find the standard deviation instead of using standard library functions."

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(101, 192), match='incorrectly calculated standard deviation, potent>}
  summary='This function attempts to standardize a list by subtracting the mean and dividing by a non-standard, incorrectly calculated standard deviation, potentially resulting in incorrect normalization values.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes z-scores for all input values using the mean and standard deviation derived exclusively from finite values, with non-finite values preserved in the output.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(133, 163), match='critical errors in calculating'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by the standard deviation, but contains critical errors in calculating both the variance and standard deviation, making its output unreliable for statistical normalization.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, computing mean and standard deviation from finite values only, but applies the transformation to all values including non-finite ones.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function attempts to standardize (z-score normalize) input values by removing non-finite numbers for calculations, then applying questionable variance and standard deviation computations before returning standardized values for all original inputs.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function standardizes a list of numbers by calculating their z-scores, notably filtering out non-finite values for its statistical calculations and using an iterative numerical method (Newton's method) to find the standard deviation instead of a direct square root function."

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(103, 142), match='incorrect measure of standard deviation'>}
  summary='This function attempts to standardize a list of numbers by computing their mean and an unconventional, incorrect measure of standard deviation, then returns each value transformed accordingly.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function computes Z-scores for all values in a list using the mean and standard deviation derived exclusively from finite elements, leveraging Newton's method for standard deviation approximation."

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(133, 163), match='critical errors in calculating'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by the standard deviation, but contains critical errors in calculating both the variance and standard deviation, making it produce incorrect results.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, centering the data around zero and scaling by the standard deviation, while handling non-finite values by excluding them from mean/std calculations but including them (as NaN) in the output.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function attempts to standardize input data by converting it to z-scores (subtracting mean and dividing by standard deviation), using an unconventional implementation for variance calculation and including non-finite values in the output.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by calculating their z-scores based on the population mean and standard deviation, notably using inefficient and complex algorithms to compute the variance and its square root.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the mean and standard deviation of all finite values in a list, then returns a new list where each value in the original list (including infinities and NaNs) is standardized (z-score), with non-finite values resulting in non-finite outputs.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes z-scores for all input values using the mean and standard deviation derived from finite elements, with iterative square-root approximation and edge-case vulnerabilities.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(133, 152), match='mathematical errors'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by the standard deviation, but contains mathematical errors in calculating both the variance and standard deviation, making its output unreliable for proper statistical standardization.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score standardization (subtract mean, divide by standard deviation) on a list of numbers, using only finite values to compute statistics but applying the transformation to all input values, with an unusual implementation that uses Newton's method to compute the square root."

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(110, 129), match='mathematical errors'>}
  summary='This function attempts to standardize a dataset (subtract mean and divide by standard deviation) but contains mathematical errors in its variance and standard deviation calculations.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by calculating the z-score for each element, using only the finite values to compute the necessary mean and standard deviation via unconventional algorithms.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(121, 173), match='incorrect calculation of mean and standard deviat>}
  summary='This function attempts to standardize a numeric list by removing non-finite values and applying a nonstandard and likely incorrect calculation of mean and standard deviation before returning the Z-scored values.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes Z-scores for input values using the mean and iteratively approximated standard deviation of finite values, retaining non-finite entries in the output.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(142, 161), match='mathematical errors'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by the standard deviation, but contains critical mathematical errors in computing both the variance and standard deviation, resulting in incorrect standardization.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes a list of numbers by computing z-scores (subtracting mean and dividing by standard deviation), but uses an unnecessarily complex implementation with Newton's method for square root calculation and produces NaN/infinity for non-finite input values."

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(110, 129), match='mathematical errors'>}
  summary='This function attempts to standardize a dataset (subtract mean and divide by standard deviation) but contains mathematical errors in its standard deviation calculation and handles non-finite values inconsistently.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the z-score for each element in a list by first computing the mean and population standard deviation of its finite values, notably using an iterative Newton-Raphson method to find the square root for the standard deviation instead of a direct function call.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(128, 168), match='incorrect formula for standard deviation'>}
  summary='This function attempts to standardize a list of numbers while ignoring non-finite values, but it uses a non-standard and likely incorrect formula for standard deviation, making its results unreliable for proper standardization.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes Z-scores for all input values using the mean and standard deviation derived from finite entries, with non-finite inputs propagating to the output.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(134, 153), match='mathematical errors'>}
  summary='This function attempts to standardize a list of numbers by removing the mean and scaling by standard deviation, but contains critical mathematical errors in computing both the variance and standard deviation, resulting in incorrect standardization.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score standardization (mean centering and scaling by standard deviation) on a list of numbers, using a custom Newton's method implementation for calculating the square root and gracefully handling non-finite values."

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates z-scores for input data by filtering out non-finite values, computing mean and variance, approximating standard deviation, and standardizing all original values (including non-finite ones).'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by calculating their z-scores, robustly ignoring non-finite values for its statistical calculations and using unconventional numerical algorithms to compute the variance and standard deviation.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function removes non-finite values from the input, computes the mean and standard deviation (using a custom square root loop), and returns the Z-scores (standardized values) for all original values in the list.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes z-scores for input data using population statistics derived from finite values, with Newton-Raphson approximation for standard deviation.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(141, 160), match='mathematical errors'>}
  summary='This function attempts to standardize a list of numbers by subtracting the mean and dividing by the standard deviation, but contains several mathematical errors in computing the variance and standard deviation.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, using finite values to calculate mean and standard deviation (via an iterative square root method), but applies the transformation to all values including non-finite ones.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function attempts to standardize a dataset by subtracting the mean and dividing by a standard deviation calculated through an unconventional method, while preserving non-finite values in the output.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function standardizes a list of numbers by calculating their z-scores based on the mean and population standard deviation of the finite values, notably using Newton's method to compute the square root for the standard deviation."

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(131, 185), match='incorrect formulas for variance and standard devi>}
  summary='The function intends to "standardize" a list (z-score), ignoring infinite and NaN values for mean and standard deviation, but uses incorrect formulas for variance and standard deviation, so the results will generally be inaccurate.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(111, 139), match='mishandles non-finite values'>}
  summary='This function standardizes input data using finite-value-derived statistics but crashes on constant inputs and mishandles non-finite values.'

Meaningless Names¶
In [28]:
std_meaningless_names_results = explain_code(
    "std-meaningless-names",
    std_meaningless_names,
    std_valid_re,
    std_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, transforming them to have mean 0 and standard deviation 1, while handling non-finite values by excluding them from the statistics calculation.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on a list of numbers, transforming each value to its number of standard deviations from the mean while handling non-finite values appropriately.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs a robust standardization (z-score normalization) of input data, handling non-finite values while computing statistics only from finite values, and includes an iterative square root approximation for the standard deviation.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes the input list `a` by calculating the z-score for each element, using the mean and standard deviation computed only from the finite numbers within the list.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the z-score (standard score) of each element in the input list, centering by the mean and scaling by the standard deviation of the finite values, using an iterative method to approximate the standard deviation.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes elements of the input list using the mean and standard deviation of its finite values, computed via iterative approximation and population statistics.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on a list of numbers, handling non-finite values by computing statistics only from finite values, but applying the transformation to all original values.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization (normalization) on a list of numbers, transforming them to have a mean of 0 and standard deviation of 1, while using an iterative method to compute the square root for the standard deviation calculation.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on an input list, filtering out non-finite values for calculations but preserving them in the output, using an iterative approximation for standard deviation.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes an input list of numbers by calculating the z-score for each element, based on the mean and standard deviation of only the finite values in the list.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function takes a list, computes the mean and standard deviation of its finite values, and returns a list of z-scores (standardized values) for each original element.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the z-scores of input values using the population standard deviation of finite elements, with non-finite values unchanged in the output.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, calculating the mean and standard deviation from only the finite values but applying the transformation to all values including inf/-inf/NaN.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on a list of numbers, transforming each value to have zero mean and unit variance while handling non-finite values appropriately.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes (z-score normalizes) an input list by subtracting the mean and dividing by the standard deviation, while handling non-finite values and using Newton's method for square root approximation."

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by calculating their z-scores, using the mean and population standard deviation computed from only the finite values in the list.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes a z-score normalization (subtracting the mean and dividing by the standard deviation, ignoring non-finite values) for each element in the input list, using a Newton-Raphson-like method to approximate the standard deviation.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates z-scores for finite elements in a list using the population standard deviation, propagating non-finite values unchanged.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, transforming them to have zero mean and unit standard deviation while handling non-finite values appropriately.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by converting each value to its z-score (number of standard deviations from the mean), while filtering out non-finite values for the statistical calculations.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes the input list by subtracting the mean and dividing by the standard deviation (computed only from finite values), using 10 iterations of Newton's method for the square root calculation."

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes the input list by calculating the Z-score for each element, using a mean and standard deviation derived from only the finite numbers in the list.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes a list by subtracting its mean and dividing by an unusually computed 'standard deviation'-like value, using only finite values for statistics but applying the transformation to all inputs."

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates z-scores for input values using the population standard deviation of finite elements, with non-finite inputs preserved in the output.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by subtracting the mean and dividing by the standard deviation (computed only from finite values), returning z-scores for all input values.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on an input array, calculating mean and standard deviation from finite values only, but applying the transformation to all values including non-finite ones.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function calculates the z-scores (standardization) of input values while ignoring non-finite numbers, using 10 iterations of Newton's method to approximate standard deviation."

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by calculating the z-score of each element, using a mean and standard deviation computed from only the finite values in the input list.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(170, 212), match='somewhat resembling square root refinement'>}
  summary='The function normalizes each element in the input by subtracting the mean and dividing by a value derived from half the population variance, iteratively updated in a way somewhat resembling square root refinement, rather than using the standard deviation directly.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates z-scores using the population mean and standard deviation of finite values in the input list, while preserving non-finite elements in the output.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on a list of numbers, handling non-finite values by computing statistics only from finite values, but applying the transformation to all original values.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates z-scores (standardized values) for an input array by computing the mean and standard deviation from finite values only, then applying the standardization formula (value - mean) / std_dev to all original values.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function calculates the z-scores of input values by robustly computing mean and standard deviation while handling non-finite values, using Newton's method for precise standard deviation calculation."

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes the input list of numbers by calculating their Z-scores using the mean and standard deviation derived only from the finite values.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function normalizes a list of numbers by subtracting the mean and dividing by the standard deviation (computed only from finite values), returning the normalized scores for all original elements.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates z-scores for all input values using the population mean and standard deviation derived exclusively from finite elements, preserving non-finite values in the output.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by converting them to z-scores (subtracting the mean and dividing by the standard deviation), using an iterative method to calculate the standard deviation and handling non-finite values by excluding them from the statistics calculation.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on the input list, transforming each value to its standard score by subtracting the mean and dividing by the standard deviation of the finite values.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes input data by subtracting the mean and dividing by an iteratively refined standard deviation estimate, while properly handling non-finite values.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers (calculates their z-scores) using a mean and standard deviation derived exclusively from the finite values within the list.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function returns a z-score normalization (mean 0, variance 1) of all input values (including non-finite ones, which will yield non-finite results), after ignoring non-finite values in its calculations, and uses an iterative approximation to standard deviation.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the z-scores of all elements in a list using the mean and population standard deviation derived from its finite elements.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by computing z-scores (subtracting mean and dividing by standard deviation), handling non-finite values by excluding them from statistics calculation but including them in the output.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on a list of numbers, transforming each value to have zero mean and unit variance, while handling non-finite values by computing statistics only from finite values but applying the transformation to all input values.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes (z-score normalizes) an input list by removing non-finite values, computing the mean and approximate standard deviation, and transforming all original values (including non-finite ones) to their z-scores.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by calculating the z-score for each element, using the mean and standard deviation derived only from the finite values in the input.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates the z-score for each element in an input list (accounting only for finite values in the mean and standard deviation calculation), using an iterative approximation for the standard deviation.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes z-scores for all elements in the input list using the mean and standard deviation derived from its finite values.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on the input list, transforming each value to have zero mean and unit variance, while preserving non-finite values in their original positions.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, calculating the mean and standard deviation from finite values only, then returning standardized scores for all input values.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs finite-only mean and variance calculations, then returns a z-score normalized version of the original input (including non-finite values).'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by calculating the z-score for each element, using a mean and standard deviation computed exclusively from the finite values in the list.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(141, 189), match='non-standard formula loosely related to variance'>}
  summary='This function normalizes a list of numbers by subtracting the mean (of all finite values) and dividing by a custom scale `j` computed from a non-standard formula loosely related to variance, and applies the normalization to all original input values, including any non-finite ones.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes z-scores for all input values using the mean and population standard deviation derived from finite elements, preserving non-finite values in the output.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes an input array by computing z-scores (subtracting the mean and dividing by the standard deviation), using Newton's method to calculate the standard deviation and handling non-finite values appropriately."

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score normalization (standardization) on an input array, computing mean and standard deviation from finite values only, where the standard deviation is calculated using Newton's method for square root approximation."

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates the z-scores of input values by first filtering out non-finite numbers, computing mean and standard deviation, then normalizing all original values (including non-finite ones).'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by calculating the z-score for each element, after first filtering out non-finite values to compute the necessary mean and standard deviation.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(115, 138), match='variance-like statistic'>}
  summary='This function standardizes the input list by centering its finite values to mean zero and scaling by a customized, variance-like statistic, resulting in z-score-like outputs but using a nonstandard formula for dispersion.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes z-scores for all elements in a list using the population standard deviation of finite values.'

Ignore Names¶
In [29]:
std_ignore_names_results = explain_code(
    "std-ignore-names",
    std_meaningful_names,
    std_valid_re,
    std_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
    extra_instructions=ignore_names,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function computes z-scores (standardized values) for a list of numbers by subtracting the mean and dividing by the standard deviation, using Newton's method to compute the square root of the variance, while preserving non-finite values in their original positions."

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score standardization on the input data, using Newton's method to compute the standard deviation and handling non-finite values by excluding them from statistics calculation but preserving them in the transformed output."

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the z-scores of input values by computing mean and standard deviation (using an iterative square root method) while ignoring non-finite values in calculations but preserving them in the output.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes every element in an input list by first calculating the population mean and standard deviation of only the finite numerical values, notably computing the standard deviation via an iterative Newton-Raphson approximation for the square root, and then applies the standardization formula to all original elements.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers using the mean and population standard deviation (computed only from finite values) and applies this transformation to all original input values.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes z-scores for input data using population statistics, handles non-finite values gracefully, and approximates standard deviation via iterative methods.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, handling non-finite values by excluding them from mean and standard deviation calculations, but including them (as NaN) in the output.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score normalization (standardization) on a list of numbers, computing mean and standard deviation from finite values only but applying the transformation to all input values, using Newton's method to calculate the square root for the standard deviation."

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the mean and standard deviation of finite values in a list, then returns a standardized version of all original values (including non-finite ones) by subtracting the mean and dividing by the iteratively-calculated standard deviation.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by calculating the mean and population standard deviation from its finite values—using an iterative algorithm for the standard deviation—and then applies this transformation to every element in the original list.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function takes a list of numbers, computes the mean and standard deviation ignoring non-finite values, and returns a new list where each original value is expressed as its z-score relative to the mean and standard deviation of the finite values.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes input data by computing the mean and population standard deviation of finite values, then applies z-score normalization to all elements of the original list, but may fail for edge cases like constant data or empty inputs.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(145, 162), match='incorrect results'>}
  summary='This function attempts to standardize data by centering and scaling, but contains critical bugs in the variance calculation that make it produce incorrect results, while also preserving non-finite values in the output.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, calculating mean and standard deviation from finite values only, but applying the transformation to all values including non-finite ones.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(110, 129), match='mathematical errors'>}
  summary='This function attempts to standardize a dataset (subtract mean and divide by standard deviation) but contains mathematical errors in its variance calculation and uses an iterative approximation for square roots.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by first calculating the population mean and standard deviation of only the finite values—using convoluted and iterative methods—and then applies these statistics to compute a z-score for every element in the original list.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function filters out all non-finite values from an input list, computes the *population*-standard deviation via a Newton-Raphson (Babylonian) iterative method, and then returns a list in which every original input value (including non-finite ones) is normalized by subtracting the mean and dividing by the estimated standard deviation.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes z-scores for input data using population statistics derived only from finite values, while preserving non-finite values in the output.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score standardization on a list of numbers, computing mean and standard deviation only from finite values but applying the transformation to all input values, using Newton's method to calculate the square root for the standard deviation."

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score standardization on a list of numbers, computing mean and standard deviation from finite values only but applying the transformation to all values, using Newton's method to calculate the square root for the standard deviation."

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates z-scores for input values by removing non-finite values to compute mean and standard deviation, then applies standardization to all original values (including non-finite ones).'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes an input list of numbers (calculates their z-scores) by first computing the mean and an iterative approximation of the standard deviation from only the finite values.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by computing the mean and variance of its finite values, approximating the standard deviation using an iterative square root method, and returning the z-score for each input value (including infinities and NaNs).'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(101, 124), match='contains critical flaws'>}
  summary='This function attempts to standardize finite data values to z-scores using population statistics but contains critical flaws in edge-case handling, efficiency, and numerical stability[1][3][4].'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on the input data, computing mean and standard deviation from finite values only, but applying the transformation to all values including non-finite ones.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, using only finite values to calculate statistics but applying the transformation to all input values, including non-finite ones.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the z-scores of input values by first calculating the mean and standard deviation (using finite values only), with the latter computed through Newton-Raphson iteration, then standardizing all input values including non-finite ones.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes an iterable of numbers by calculating the z-score for each element, where the mean and variance are computed from only finite values and the standard deviation is uniquely found via a 10-iteration numerical approximation rather than a direct square root function.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(119, 190), match='incorrectly calculates the standard deviation via>}
  summary='This function attempts to standardize a list by converting each entry to its Z-score using only the finite values, but incorrectly calculates the standard deviation via a nonstandard formula and iterative refinement, so the results are only rough approximations of proper Z-scores.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes z-scores for input values using the mean and standard deviation of finite entries, with non-finite values preserved in the output.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes a list of numbers by subtracting the mean and dividing by the standard deviation (computed only from finite values), but uses an iterative Newton's method implementation to calculate the square root instead of using the built-in sqrt function."

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score normalization (standardization) on a list of numbers, computing statistics only from finite values but applying the transformation to all input values, using Newton's method to calculate the standard deviation."

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates z-scores for input data by computing mean and standard deviation (using finite values only) while preserving non-finite values in the output.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by calculating their z-scores, using only the finite values to compute the mean and standard deviation via convoluted and iterative methods rather than direct calculation.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes a numeric list by filtering out non-finite values for statistics, manually computing the mean and standard deviation, and then normalizing each original value (using \\((value - mean) / std\\)), where the standard deviation is calculated iteratively via Newton's method."

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes z-scores for all input values using the population mean and standard deviation derived from finite elements.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes a list of numbers by subtracting the mean and dividing by the standard deviation (computed only from finite values), but applies this transformation to all values including non-finite ones.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function computes z-score standardization of input data, calculating statistics only from finite values but applying the transformation to all values, using Newton's method to compute the standard deviation."

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates the z-scores of input values (standardization) by computing the mean and standard deviation while ignoring non-finite values, using an iterative method to calculate the square root for the standard deviation.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes a list of numbers by calculating their z-scores, but after filtering out non-finite values to compute the mean and population variance, it uniquely uses 10 iterations of Newton's method to find the standard deviation instead of a direct square root function."

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(180, 197), match='incorrect) method'>}
  summary='This function attempts to "standardize" a list of numbers by subtracting the mean and dividing by a value loosely related to the standard deviation, but it uses a nonstandard (and incorrect) method for this divisor, leading to output values that do not actually represent standardized (z-scored) data.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes z-scores for input values using the population mean and standard deviation of finite elements, retaining non-finite values in the output.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score normalization (standardization) on a list of numbers, computing mean and standard deviation from finite values only, but applying the transformation to all values including non-finite ones.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score normalization on the input data, computing mean and standard deviation from finite values only, but uses Newton's method instead of `math.sqrt()` to calculate the standard deviation."

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes z-scores for all input values (including non-finite ones) by calculating mean and standard deviation from only the finite values, using an iterative square root approximation.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the mean and standard deviation of only the finite values in an input list and then uses these statistics to standardize every value in the original list, returning their Z-scores.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the mean and (approximately) the standard deviation (using an iterative method) of only the finite values in a list, then returns a new list where every original value is centered and scaled by these statistics, effectively performing a type of standardization that ignores non-finite values when estimating parameters.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function standardizes numerical data by centering it on the mean and scaling by the iteratively computed population standard deviation, while preserving non-finite values unchanged.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(132, 158), match='error-prone implementation'>}
  summary="This function attempts to standardize an input array by subtracting the mean and dividing by the standard deviation, but it uses an error-prone implementation with Newton's method for square root calculation and includes a bug where non-finite values are transformed using statistics computed only from finite values."

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function performs z-score standardization on a list of numbers, computing mean and standard deviation from finite values only but applying the transformation to all values, using Newton's method to calculate the square root."

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(141, 175), match='partially incorrect implementation'>}
  summary='The function attempts to standardize input values by subtracting the mean and dividing by standard deviation, but uses an unconventional and partially incorrect implementation for calculating variance.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function standardizes a list of numbers by first calculating the population mean and variance of its finite elements, then uniquely computes the standard deviation using 10 iterations of Newton's method for finding a square root, and finally applies the z-score transformation to all of the original elements."

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the mean and standard deviation of the finite values in a list, then returns a standardized (z-score) version of the entire list based on those statistics, using a manual iterative method to estimate the standard deviation.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes Z-scores for an input list by deriving mean and standard deviation from its finite values, using iterative approximation for standard deviation and returning standardized results for all original elements.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, computing mean and standard deviation from finite values only, but applying the transformation to all input values including non-finite ones.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function performs z-score standardization on a list of numbers, computing mean and standard deviation from finite values only, but applying the transformation to all values including NaN and infinity.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(123, 152), match='potentially inaccurate method'>}
  summary='This function attempts to standardize input data (subtract mean, divide by standard deviation) but uses an inefficient and potentially inaccurate method for calculating standard deviation while preserving non-finite values in the output.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function standardizes a list of numbers by first calculating the population mean and standard deviation using only the finite values, and then applies the z-score formula to all elements of the original list.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(86, 126), match='incorrectly computed "standard deviation'>}
  summary='This function attempts to z-score a vector by subtracting its mean and dividing by an incorrectly computed "standard deviation", but uses a mathematically invalid approach for scaling that will yield inaccurate standardized values.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes z-scores for all input values using the population mean and standard deviation of finite elements, preserving non-finite values in the output.'

Results¶
In [30]:
plot_results(
    "Standardization",
    std_meaningful_names_results,
    std_meaningless_names_results,
    std_ignore_names_results,
    include_lengths=True,
)
No description has been provided for this image

HighLife¶

HighLife) is a variation of the cellular automaton called Conway's Game of Life, with B36/S23 rules.

One way to implement it:

In [31]:
def highlife_basic(alive_cells):
    def should_survive(cell):
        return count_alive_neighbors(cell) in {2, 3}

    def should_be_born(cell):
        return (not is_alive(cell)) and count_alive_neighbors(cell) in {3, 6}

    def count_alive_neighbors(cell):
        return sum(1 for neighbor in neighbors(cell) if is_alive(neighbor))

    def is_alive(cell):
        return cell in alive_cells

    def neighbors(cell):
        return [(cell[0] + dx, dy + cell[1]) for dx, dy in neighbor_offsets()]

    def neighbor_offsets():
        return [
            (dx, dy)
            for dx in range(-1, 2)
                for dy in range(-1, 2)
                    if dx != 0 or dy != 0
        ]

    return set(
        sum(
            [
                ([cell] if should_survive(cell) else [])
                + [
                    neighbor
                    for neighbor in neighbors(cell)
                        if should_be_born(neighbor)
                ]
                for cell in alive_cells
            ],
            start=[],
        )
    )


def test_highlife(func):
    step_1 = {
                (1, 0),
                        (2, 1),
        (0, 2), (1, 2), (2, 2),
    }
    step_2 = {
        (0, 1),         (2, 1),
                (1, 2), (2, 2),
                (1, 3),
    }
    assert_eq(step_2, func(step_1))


test_highlife(highlife_basic)

Complication: various small inconsistencies in conditionals and neighbor coordinate offset calculations.

In [32]:
def highlife(alive_cells):
    def should_survive(cell):
        return count_alive_neighbors(cell) // 2 == 1

    def should_be_born(cell):
        return (not is_alive(cell)) and count_alive_neighbors(cell) in {6, 3}

    def count_alive_neighbors(cell):
        return sum(is_alive(neighbor) for neighbor in neighbors(cell))

    def is_alive(cell):
        return cell in alive_cells

    def neighbors(cell):
        return [(cell[0] + dx, dy + cell[1]) for dx, dy in neighbor_offsets()]

    def neighbor_offsets():
        return [
            (dx, dy - 1)
            for dx in range(-1, 2)
                for dy in range(3)
                    if dx or dy != 1
        ]

    return set(
        sum(
            [
                ([cell] if should_survive(cell) else [])
                + [
                    neighbor
                    for neighbor in neighbors(cell)
                        if should_be_born(neighbor)
                ]
                for cell in alive_cells
            ],
            start=[],
        )
    )


def f(a):
    def b(c):
        return d(c) // 2 == 1

    def e(c):
        return (not g(c)) and d(c) in {6, 3}

    def d(c):
        return sum(g(h) for h in i(c))

    def g(c):
        return c in a

    def i(c):
        return [(c[0] + j, k + c[1]) for j, k in l()]

    def l():
        return [
            (j, k - 1)
            for j in range(-1, 2)
                for k in range(3)
                    if j or k != 1
        ]

    return set(
        sum(
            [
                ([c] if b(c) else [])
                + [
                    h
                    for h in i(c)
                        if e(h)
                ]
                for c in a
            ],
            start=[],
        )
    )


test_highlife(highlife)
test_highlife(f)

highlife_valid_re = re.compile(
    (
        r"(highlife)"
        r"|(game.of.life)"
        r"|(cellular automaton)"
        r"|(b36.*s23|b63.*s23|s23.*b36|s23.*b63|b36.*s32|b63.*s32|s32.*b36|s32.*b63)"
        r"|(returns a new set.*coordinates.*neighbor.*count condition.*neighbors.*meet another neighbor.*count condition)"
    ),
    re.IGNORECASE,
)

highlife_invalid_ptrns = {
    "suspects_bug": re.compile(
        (
            r"(asymmetric.*neighborhood)"
            r"|(contains a( critical)? bug)"
            r"|(contains( critical)? bugs)"
            r"|(but may not exactly replicate)"
            r"|(retaining some coordinates)"
            r"|(custom neighbor definition)"
            r"|([^8]-cell neighborhood)"
            r"|([^8]-neighbor topology)"
            r"|(coordinate calculation bug)"
            r"|(hexagonal neighborhood)"
            r"|(bug in the neighbor calculation)"
            r"|(modified Moore neighborhood)"
            r"|(not exactly.*8 adjacent cells)"
        ),
        re.IGNORECASE,
    )
}

highlife_meaningful_names = inspect.getsource(highlife)
highlife_meaningless_names = inspect.getsource(f)
Meaningful Names¶
In [33]:
highlife_meaningful_names_results = explain_code(
    "hl-meaningful-names",
    highlife_meaningful_names,
    highlife_valid_re,
    highlife_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements the HighLife cellular automaton, a variant of Conway's Game of Life where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, returning the next generation of alive cells."

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation according to HighLife rules (survive with 2-3 neighbors, born with 3 or 6 neighbors).'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells in the HighLife automaton, applying survival and birth rules to the input set of alive cells, using a custom neighborhood definition.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function calculates the next generation of a cellular automaton known as 'HighLife' by applying its specific birth (3 or 6 neighbors) and survival (2 or 3 neighbors) rules to a given set of alive cells."

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The `highlife` function updates a set of living cells to the next generation according to the "Highlife" cellular automaton rules (survival with 2 or 3 neighbors; birth with 3 or 6 neighbors), returning the new set of living cells.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells for the HighLife cellular automaton by applying survival and birth rules based on neighbor counts, using an efficient neighbor-checking approach focused on active regions.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(167, 181), match='contains a bug'>}
  summary='This function implements the HighLife cellular automaton (a Game of Life variant where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors), though it contains a bug in the neighbor offset calculation.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on the B36/S23 rule.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of cells in the HighLife automaton by applying survival/birth rules to the current alive cells and their neighbors.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates the next state of the "HighLife" cellular automaton by applying its survival (a live cell survives with 2 or 3 neighbors) and birth (a dead cell is born with 3 or 6 neighbors) rules to a given set of living cells.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of living cells for the "HighLife" cellular automaton (B36/S23), where live cells survive with 2 or 3 neighbors, and new cells are born with 3 or 6 neighbors.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next state of the HighLife cellular automaton, preserving live cells with 2–3 neighbors and reviving dead cells with exactly 3 or 6 neighbors.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(178, 192), match='contains a bug'>}
  summary="This function implements the HighLife cellular automaton, a variant of Conway's Game of Life where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, though it contains a bug in the neighbor offset calculation."

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements one generation step of the HighLife cellular automaton (a variant of Conway's Game of Life), taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on survival (2-3 neighbors) and birth (3 or 6 neighbors) rules."

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells in the HighLife automaton by applying survival/birth rules to the current alive cells and their neighbors.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function simulates one generation of a cellular automaton known as 'HighLife' by calculating the set of cells that will be alive in the next step according to its unique birth/survival rules (B36/S23)."

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function simulates one step of the "HighLife" cellular automaton, generating the next generation of live cells from the current state based on custom birth and survival rules.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function evolves a cellular automaton state using HighLife rules, processing survival of live cells and birth in adjacent dead cells to compute the next generation.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(166, 180), match='contains a bug'>}
  summary='This function implements one generation step of the HighLife cellular automaton, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, though it contains a bug in the neighbor offset calculation.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on survival (2-3 neighbors) and birth (3 or 6 neighbors) rules.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of cells in the HighLife automaton by applying survival and birth rules to the current alive cells and their neighbors.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function calculates the next generation of a cellular automaton called 'HighLife' by applying its specific survival (2 or 3 neighbors) and birth (3 or 6 neighbors) rules to a given set of currently alive cells."

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of alive cells in the "Highlife" cellular automaton, where cells survive with 2 or 3 living neighbors and are born with exactly 3 or 6 living neighbors.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function evolves a cellular automaton state using HighLife rules (birth at 3/6 neighbors, survival at 2/3 neighbors).'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(240, 254), match='contains a bug'>}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the next generation based on survival rules (2-3 neighbors) and birth rules (3 or 6 neighbors), though it contains a bug in the neighbor calculation logic.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on survival (2-3 neighbors) and birth (3 or 6 neighbors) rules.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of alive cells in the HighLife cellular automaton by applying survival and birth rules to the input set of alive cells.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function calculates the next generation of a cellular automaton called 'HighLife' (rule B36/S23) by determining which cells survive and which new cells are born based on the number of their live neighbors."

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function takes a set of currently alive cells and returns the set of alive cells for the next generation according to the HighLife cellular automaton rules, where cells survive with 2 or 3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells for the HighLife cellular automaton by applying survival and birth rules to the current alive cells and their neighbors.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(185, 199), match='contains a bug'>}
  summary="This function implements the HighLife cellular automaton rules (a variant of Conway's Game of Life where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors), though it contains a bug in the neighbor calculation logic."

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of a HighLife cellular automaton by applying survival rules (2-3 neighbors) and birth rules (3 or 6 neighbors) to a set of currently alive cells on a 2D grid.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function simulates one iteration of the HighLife cellular automaton, evolving the grid by applying survival and birth rules to the current live cells and their neighbors.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation of cells in a cellular automaton known as "HighLife", where a cell is born if it has 3 or 6 neighbors and an existing cell survives if it has 2 or 3 neighbors.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(190, 219), match='but may not exactly replicate'>}
  summary='The `highlife` function computes the next generation of a HighLife cellular automaton from a set of alive cell coordinates, using custom rules for cell survival and birth that closely match—but may not exactly replicate—the standard HighLife automaton due to nuances in neighbor calculation.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells in the HighLife automaton by applying survival rules to live cells and birth rules to their neighbors, returning a set of alive cells for the subsequent state.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(185, 198), match='contains bugs'>}
  summary="This function implements the HighLife cellular automaton rules (a variant of Conway's Game of Life where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors), though it contains bugs in the neighbor calculation logic."

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements one generation step of the HighLife cellular automaton (a variant of Conway's Game of Life with birth rule B36/S23), taking a set of alive cell coordinates and returning the next generation's alive cells."

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of cells in the HighLife automaton by applying survival/birth rules based on neighbor counts (2/3 for survival, 3/6 for birth).'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation of a cellular automaton known as "HighLife" by applying its specific survival (2 or 3 neighbors) and birth (3 or 6 neighbors) rules to a given set of living cells.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of live cells in the HighLife cellular automaton, applying the rules for cell survival and birth based on neighbor counts.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='Computes the next state of a HighLife cellular automaton by applying survival/birth rules to living/dead cells and their neighbors.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements the HighLife cellular automaton rules, taking a set of alive cell coordinates and returning the next generation where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on survival rules (2-3 neighbors) and birth rules (3 or 6 neighbors).'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of cells in the HighLife automaton by applying survival/birth rules to the current alive cells and their neighbors.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates the next generation of a cellular automaton called "HighLife" by iterating through currently living cells to determine which will survive (with 2 or 3 neighbors) and which of their dead neighbors will be born (with 3 or 6 neighbors).'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of alive cells in the HighLife cellular automaton, based on the standard survival and birth rules.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function `highlife` evolves a set of alive cells to the next generation under HighLife rules, where cells survive with 2–3 neighbors and dead cells revive with exactly 3 or 6 neighbors.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(145, 158), match='contains bugs'>}
  summary="This function implements the HighLife cellular automaton (a variant of Conway's Game of Life where cells can also be born with 6 neighbors), but contains bugs in the neighbor calculation that will cause incorrect behavior."

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on rules where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of cells in the HighLife cellular automaton by applying survival rules (live cells with 2 or 3 neighbors survive) and birth rules (dead cells with 3 or 6 neighbors become alive).'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function calculates the next generation of a cellular automaton known as 'HighLife' by applying its specific birth (B36) and survival (S23) rules to a given set of currently alive cells."

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The `highlife` function simulates one step of the HighLife cellular automaton, returning the next set of live cells based on the classic "B36/S23" rules.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function evolves cellular automaton states using HighLife rules, where live cells survive with 2–3 neighbors and dead cells are born with exactly 3 or 6 neighbors.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements the HighLife cellular automaton, a variant of Conway's Game of Life where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, returning the next generation of alive cells."

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, returning the set of alive cells in the next generation based on survival rules (2-3 neighbors) and birth rules (3 or 6 neighbors).'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next state of a HighLife cellular automaton by applying survival and birth rules to the current alive cells and their neighbors.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function simulates one generational step of the 'HighLife' cellular automaton, a variant of Conway's Game of Life, by applying its unique birth rule (B36) and standard survival rule (S23) to a given set of alive cells."

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function takes a set of live cells on a grid and returns the set representing the next generation according to the "Highlife" cellular automaton rule, where cells survive with 2 or 3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells in the HighLife cellular automaton by applying survival rules (2–3 neighbors) and birth rules (3 or 6 neighbors) to the current live cells and their neighbors.'

Meaningful Names¶
In [34]:
highlife_meaningless_names_results = explain_code(
    "hl-meaningless-names",
    highlife_meaningless_names,
    highlife_valid_re,
    highlife_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(179, 209), match='asymmetric 8-cell neighborhood'>}
  summary='This function implements a cellular automaton that evolves a set of coordinates by keeping cells with 2-3 neighbors and adding empty cells with exactly 3 or 6 neighbors, using an asymmetric 8-cell neighborhood pattern.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a modified cellular automaton that evolves a set of 2D coordinates to the next generation, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function implements a cellular automaton rule that transforms an input set of coordinates by keeping points with 2-3 active neighbors and adding inactive points with exactly 3 or 6 active neighbors, returning the result as a unique set.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation of a cellular automaton by applying custom rules to an input set of live cell coordinates: a live cell survives if it has 2 or 3 neighbors, and a dead cell is born if it has exactly 3 or 6 neighbors.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function computes the next generation of alive cells for a two-dimensional cellular automaton similar to Conway's Game of Life, except new cells are born with either 3 or 6 neighbors."

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function models a cellular automaton where live cells survive with 2–3 neighbors, and dead cells birth with exactly 3 or 6 neighbors, returning coordinates of the next generation.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements one iteration of a modified Conway's Game of Life cellular automaton where live cells survive with 2-3 neighbors and dead cells become alive with exactly 3 or 6 neighbors."

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a cellular automaton that evolves a set of coordinates by keeping cells with 2-3 neighbors alive and birthing new cells at empty positions with exactly 3 or 6 neighbors.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function `f(a)` implements a cellular automaton rule that returns a set of coordinates from the input `a` and their neighbors, filtered by conditions on the number of adjacent live cells.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function simulates a single generation of the "HighLife" (B36/S23) variant of Conway\'s Game of Life, taking a set of live cell coordinates and returning the set of live cells for the next generation.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the "Highlife" (B36/S23) cellular automaton, returning the set of live cells for the next generation.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of a cellular automaton where live cells survive with 2–3 neighbors and dead cells birth with 3 or 6 neighbors, returning the resulting set of live cells.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements one generation step of a modified cellular automaton (similar to Conway's Game of Life) where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, taking a set of coordinates as input and returning the next generation as a set of coordinates."

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements a variant of Conway's Game of Life cellular automaton where live cells with 2-3 neighbors survive and dead cells with 3 or 6 neighbors become alive, returning the next generation as a set of coordinates."

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a cellular automaton rule that evolves a set of 2D coordinates by applying custom survival and birth conditions based on neighbor counts.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function simulates a single generation of a custom version of Conway's Game of Life, where a living cell survives if it has 2 or 3 neighbors, and a dead cell is born if it has exactly 3 or 6 neighbors."

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one step of a 2D cellular automaton similar to Conway’s Game of Life, where a cell survives with 2 or 3 neighbors, and a new cell is born with exactly 3 or 6 neighbors.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(190, 213), match='asymmetric neighborhood'>}
  summary='The function computes the next generation of a cellular automaton where cells survive with 2–3 alive neighbors and dead cells become alive with exactly 3 or 6 alive neighbors, using a fixed asymmetric neighborhood pattern.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements a modified version of Conway's Game of Life that computes the next generation of cells, where live cells with 2-3 neighbors survive and dead cells with 3 or 6 neighbors become alive."

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements a variant of Conway's Game of Life that takes a set of alive cell coordinates and returns the next generation, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors."

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function `f(a)` takes a set of coordinates and returns a new set containing coordinates that either meet a neighbor-count condition or are inactive neighbors of active coordinates that meet another neighbor-count condition.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation of a cellular automaton based on a set of live cells, where existing cells with 2 or 3 neighbors survive, and dead cells with exactly 3 or 6 neighbors are born.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of live cells for a 2D cellular automaton similar to Conway’s Game of Life, but with a custom birth rule that allows dead cells with exactly 3 or 6 live neighbors to become alive.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function updates a set of grid coordinates by applying a custom cellular automaton rule where cells survive with 2–3 neighbors and dead cells are born with 3 or 6 neighbors.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements one generation step of a cellular automaton (likely a variant of Conway's Game of Life) on a 2D grid, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors."

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a modified cellular automaton that evolves a set of alive cells to the next generation, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(128, 154), match='retaining some coordinates'>}
  summary='The function `f(a)` implements a cellular automaton-like rule that transforms an input set of coordinates `a` into a new set by retaining some coordinates and adding neighbors based on specific conditions involving their neighbor counts.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function calculates the next generation of a cellular automaton based on the B36/S23 ruleset (a cell is born if it's empty and has 3 or 6 neighbors; it survives if it's alive and has 2 or 3 neighbors) by processing an input set of live cell coordinates."

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(186, 212), match='custom neighbor definition'>}
  summary='This function takes a set of (2D) cell coordinates and returns the next generation according to a Game-of-Life-like rule, but with additional birth for cells with 6 live neighbors and a custom neighbor definition.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function evolves a cellular automaton state by preserving live cells with 2–3 neighbors and activating dead cells with exactly 3 or 6 neighbors.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements a modified version of Conway's Game of Life that takes a set of live cell coordinates and returns the next generation where live cells with 2-3 neighbors survive and dead cells with exactly 3 or 6 neighbors become alive."

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a cellular automaton that takes a set of "alive" cell coordinates and returns the next generation where cells survive with 2-3 neighbors and dead cells come alive with 3 or 6 neighbors.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=None invalid={'suspects_bug': None}
  summary='The function `f(a)` takes a set of coordinates `a` and returns a new set of coordinates consisting of (1) coordinates in `a` where roughly half their neighbors are in `a`, and (2) neighboring coordinates not in `a` that have exactly 6 or 3 neighbors in `a`.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates the next generation of a "HighLife" (B36/S23) cellular automaton, a specific variant of Conway\'s Game of Life.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function takes a set of coordinates representing "active" cells and returns the next generation of active cells according to custom cellular automata rules that resemble, but do not perfectly match, Conway\'s Game of Life.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of a cellular automaton where cells survive with 2–3 neighbors and dead cells become alive with exactly 3 or 6 neighbors, using Moore neighborhood rules.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a cellular automaton that evolves a set of coordinates by keeping cells with 2-3 neighbors and birthing new cells in empty spaces that have exactly 3 or 6 neighbors.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements a modified version of Conway's Game of Life that computes the next generation of cells, where alive cells survive with 2-3 neighbors and dead cells become alive with either 3 or 6 neighbors."

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function implements a cellular automaton rule that transforms a set of grid coordinates by selectively keeping some original cells and adding neighboring cells based on their neighbor counts.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function computes the next generation of cells in a variant of Conway's Game of Life where a cell is born if it has 3 or 6 neighbors and survives if it has 2 or 3 neighbors."

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function takes a set of points, applies custom neighbor-counting rules akin to a two-dimensional cellular automaton, and returns the next generation of points according to those rules.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next state of a custom cellular automaton where cells survive with 2–3 neighbors, and dead cells revive with exactly 3 or 6 neighbors.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(205, 235), match='asymmetric 8-cell neighborhood'>}
  summary='This function implements a cellular automaton that evolves a set of 2D coordinates by keeping cells with 2-3 neighbors alive and birthing new cells that have exactly 3 or 6 neighbors, using a non-standard asymmetric 8-cell neighborhood pattern.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a modified cellular automaton where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, taking a set of 2D coordinates as input and returning the next generation as output.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function implements a cellular automaton that takes a set of coordinates and returns a new set where each coordinate is either preserved or spawns neighboring coordinates based on specific neighbor-count rules.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function takes a set of 'live' cell coordinates and calculates the next generation in a cellular automaton system (similar to Conway's Game of Life) where a cell survives if it has 2 or 3 neighbors and a new cell is born if it is empty and has 3 or 6 neighbors."

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of a 2D cellular automaton with custom birth and survival rules, analogous to Conway’s Game of Life but using 8-neighbor cells, birth with 3 or 6 neighbors, and survival with 2 or 3 neighbors.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next state in a grid-based cellular automaton where cells survive with 2–3 neighbors and new cells are born with exactly 3 or 6 neighbors.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a cellular automaton rule that evolves a set of 2D coordinates by keeping cells with 2-3 neighbors and creating new cells in empty positions that have exactly 3 or 6 neighbors.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements a cellular automaton that evolves a set of 2D coordinates according to modified Conway's Game of Life rules, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors."

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function `f(a)` simulates a custom cellular automaton rule on a set of 2D coordinates `a`, returning a new set of coordinates where cells survive or are born based on the number of their neighbors in `a`.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function simulates one generation of a cellular automaton where a cell survives if it has 2 or 3 live neighbors, and a new cell is born in an empty space if it has exactly 3 or 6 live neighbors.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function takes a set of (x, y) coordinates (live cells), and computes the next generation in a variant of Conway's Game of Life using the standard 8 neighbors, where live cells survive with 2 or 3 neighbors and new cells are born with 3 or 6 neighbors."

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of a cellular automaton where cells survive with 2–3 neighbors and dead cells become active with 3 or 6 neighbors.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a cellular automaton rule on a 2D grid where cells with exactly 2 neighbors survive and empty cells with exactly 3 or 6 neighbors become alive in the next generation.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements a modified Conway's Game of Life that takes a set of alive cell coordinates and returns the next generation where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors."

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function implements a cellular automaton that evolves an input set of grid coordinates by keeping cells with 2-3 neighbors and adding new cells in empty spaces with exactly 3 or 6 neighbors.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function simulates one generation of a cellular automaton, a variant of Conway's Game of Life, where existing cells survive with 2 or 3 neighbors, and new cells are born in empty spaces with 3 or 6 neighbors."

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function `f(a)` takes a set of coordinate tuples representing "live" cells and returns the next generation according to a variant of the Game of Life rules, where cells survive with 2 or 3 neighbors and dead cells are born with 3 or 6 neighbors.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function applies a modified Game of Life rule where cells survive with 2–3 neighbors and dead cells revive with 3 or 6 neighbors.'

Ignore Names¶
In [35]:
highlife_ignore_names_results = explain_code(
    "hl-ignore-names",
    highlife_meaningful_names,
    highlife_valid_re,
    highlife_invalid_ptrns,
    repeats=REPEATS,
    temperature=TEMPERATURE,
    extra_instructions=ignore_names,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(220, 234), match='contains a bug'>}
  summary='This function implements a cellular automaton (similar to Conway\'s Game of Life variant "HighLife") that evolves a set of alive cells based on survival rules (2-3 neighbors) and birth rules (3 or 6 neighbors), though it contains a bug in the neighbor coordinate calculation.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements one generation step of the HighLife cellular automaton (a variant of Conway's Game of Life), where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors."

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(77, 96), match='7-cell neighborhood'>}
  summary='This function implements a variant of the HighLife cellular automaton with a 7-cell neighborhood, where cells survive with 2-3 neighbors and new cells are born with exactly 3 or 6 neighbors.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function calculates the next generation of a cellular automaton known as 'HighLife' by applying its specific survival (2 or 3 neighbors) and birth (3 or 6 neighbors) rules to a given set of living cells."

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function takes a set of live cell positions and returns the next generation according to the "HighLife" cellular automaton rules, where a cell survives with 2 or 3 neighbors and is born with 3 or 6 neighbors.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of cells in HighLife, a Game of Life variant where dead cells become alive with exactly 3 or 6 neighbors, while live cells survive with 2 or 3 neighbors.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(256, 282), match='coordinate calculation bug'>}
  summary="This function implements a cellular automaton (specifically the HighLife variant of Conway's Game of Life) that evolves a set of alive cells by one generation, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, though it contains a coordinate calculation bug."

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, transforming a set of alive cell coordinates into the next generation based on survival rules (2-3 neighbors) and birth rules (3 or 6 neighbors).'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(64, 86), match='hexagonal neighborhood'>}
  summary='This implements the HighLife cellular automaton variant using a hexagonal neighborhood, where cells survive with 2-3 neighbors and are born with exactly 3 or 6 neighbors.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function computes the next generation for a cellular automaton known as 'HighLife' by applying its specific survival (2 or 3 neighbors) and birth (3 or 6 neighbors) rules to a given set of alive cells."

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements a single update step of the "HighLife" cellular automaton, returning the next generation of live cell coordinates by applying specific birth and survival rules to a set of current live cell positions.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells for the HighLife automaton by applying survival rules to live cells and birth rules to their dead neighbors, returning a deduplicated set of results.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(207, 238), match='bug in the neighbor calculation'>}
  summary='This function implements the HighLife cellular automaton, computing the next generation of alive cells based on survival rules (2-3 neighbors) and birth rules (3 or 6 neighbors), though it contains a likely bug in the neighbor calculation.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on rules where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements the HighLife cellular automaton rules, computing the next generation of cells where cells survive with 2-3 neighbors and new cells are born with exactly 3 or 6 neighbors.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next state of a 2D cellular automaton by taking a set of currently alive cell coordinates and returning a new set where cells survive with 2 or 3 neighbors and new cells are born in empty locations with 3 or 6 neighbors.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of living cells in the "HighLife" cellular automaton (a variant of Conway\'s Game of Life, where dead cells with 3 or 6 live neighbors are born, and live cells with 2 or 3 survive), by evaluating a grid of cell coordinates and applying these rules.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function evolves the input set of alive cells to the next generation using HighLife rules, where cells survive with 2–3 neighbors and dead cells are born with 3 or 6 neighbors, considering only cells adjacent to live ones for efficiency.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements the HighLife cellular automaton, a variant of Conway's Game of Life where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, returning the next generation of alive cells."

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, where living cells with 2-3 neighbors survive and dead cells with 3 or 6 neighbors become alive.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(97, 116), match='7-cell neighborhood'>}
  summary='This function implements a variant of the HighLife cellular automaton with B36/S23 rules using a 7-cell neighborhood pattern, processing alive cells to generate the next generation.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation of a 2D cellular automaton where a living cell survives if it has 2 or 3 neighbors, and a dead cell becomes alive if it has exactly 3 or 6 neighbors.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one step of the "HighLife" cellular automaton (rule B36/S23), producing the next generation of live cells by applying custom survival and birth rules based on the number of live neighbors for each cell.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function `highlife` computes the next generation of cells for a HighLife automaton by applying survival/birth rules to live cells and their neighbors, using efficient neighbor counting and set operations.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements one generation step of the HighLife cellular automaton (a variant of Conway's Game of Life), where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors."

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(146, 172), match='coordinate calculation bug'>}
  summary='This function implements one generation step of the HighLife cellular automaton (a Game of Life variant with B36/S23 rules), though it contains a coordinate calculation bug in the neighbors function.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(147, 174), match='modified Moore neighborhood'>}
  summary='This function implements the HighLife cellular automaton rules, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, using a modified Moore neighborhood.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function calculates the next generation of a cellular automaton known as HighLife (rules B36/S23) by applying its specific survival and birth rules to a given set of living cell coordinates.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function takes a set of 2D coordinates representing live cells and returns the set of live cells in the next generation according to the "HighLife" cellular automaton rules (alive survives with 2 or 3 neighbors, new cells are born with 3 or 6 neighbors).'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next state of a HighLife automaton by applying survival rules (2–3 neighbors) and birth rules (3 or 6 neighbors) to the current live cells and their immediate neighbors.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements the HighLife cellular automaton (a variant of Conway's Game of Life with birth rule B36/survival rule S23), taking a set of alive cell coordinates and returning the next generation's alive cells."

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(169, 188), match='7-neighbor topology'>}
  summary='This implements a variant of the HighLife cellular automaton with custom neighbor rules, where cells survive with 2-3 neighbors, are born with 3 or 6 neighbors, using a 7-neighbor topology that excludes the top neighbor.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary="The function calculates the next generation of a cellular automaton known as 'HighLife' by applying survival rules (an active cell survives with 2 or 3 neighbors) and birth rules (a new cell is born into an empty space with 3 or 6 neighbors) to a given set of active cell coordinates."

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of a 2D grid following the "HighLife" cellular automaton rules (like Conway’s Game of Life, but dead cells with 6 neighbors are also born) given a set of live cell positions.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function evolves a cellular grid by applying Highlife rules (survival with 2–3 neighbors, birth with 3 or 6 neighbors) to generate the next state from the current live cells.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(174, 200), match='coordinate calculation bug'>}
  summary='This function implements a cellular automaton (HighLife variant) that evolves a set of alive cells to the next generation based on neighbor count rules, though it contains a coordinate calculation bug that would cause incorrect behavior.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(91, 114), match='contains a critical bug'>}
  summary='This function attempts to compute the next generation of a HighLife cellular automaton but contains a critical bug in neighbor coordinate calculation that prevents it from functioning correctly.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='This implements the HighLife cellular automaton rules, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, using an 8-cell neighborhood.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation of a "HighLife" cellular automaton by applying its specific survival (2 or 3 neighbors) and birth (3 or 6 neighbors) rules to a given set of active cells.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function takes a set of living cells in a 2D grid and computes the next generation according to the "HighLife" cellular automaton rule, where cells survive with 2 or 3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next state of cells in HighLife automaton, where cells survive with 2–3 neighbors and dead cells revive with exactly 3 or 6 neighbors.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on B36/S23 rules.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary="This function implements the HighLife cellular automaton, taking a set of alive cell coordinates and returning the next generation's alive cells based on survival (2-3 neighbors) and birth (3 or 6 neighbors) rules."

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(62, 84), match='hexagonal neighborhood'>}
  summary='This function implements the HighLife cellular automaton with hexagonal neighborhood rules, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation of cells in a 2D cellular automaton known as HighLife by applying its specific rules for cell survival (2 or 3 neighbors) and birth (3 or 6 neighbors) to a given set of living cells.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes the next generation of live cells in the "HighLife" cellular automaton, where cells survive with 2 or 3 neighbors and new cells are born with 3 or 6 neighbors.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells under HighLife rules, where survival requires 2–3 neighbors and birth requires 3 or 6 neighbors, using neighbor-counting and state transition logic.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation of the HighLife cellular automaton, where living cells with 2-3 neighbors survive and dead cells with 3 or 6 neighbors are born.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements the HighLife cellular automaton, computing the next generation of alive cells based on survival rules (2-3 neighbors) and birth rules (3 or 6 neighbors).'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(147, 166), match='6-cell neighborhood'>}
  summary='This implements the HighLife cellular automaton rules, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, using an unusual 6-cell neighborhood pattern instead of the standard 8-cell Moore neighborhood.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation for the "HighLife" cellular automaton, a variant of Conway\'s Game of Life where a cell is born if it has 3 or 6 neighbors and survives if it has 2 or 3.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function computes one evolution step of the "HighLife" 2D cellular automaton, where living cells survive with 2 or 3 neighbors and dead cells are born with 3 or 6 neighbors, returning the set of alive cells for the next generation.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function `highlife` evolves a cellular automaton state using HighLife rules (B36/S23) to compute the next generation of live cells.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, returning the new set of alive cells.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='This function implements one generation step of the HighLife cellular automaton, taking a set of alive cell coordinates and returning the set of cells that will be alive in the next generation based on neighbor count rules.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary="This implements the HighLife cellular automaton (a variant of Conway's Game of Life) where cells survive with 2-3 neighbors and are born with 3 or 6 neighbors, using an 8-cell neighborhood."

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function calculates the next generation of a "HighLife" cellular automaton by applying its specific survival (2 or 3 neighbors) and birth (3 or 6 neighbors) rules to a given set of living cells.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'suspects_bug': <re.Match object; span=(241, 286), match='not exactly, the full set of 8 adjacent cells'>}
  summary='This function takes a set of 2D cell coordinates and simulates one generation of a Game-of-Life-like cellular automaton where cells survive with 2 or 3 neighbors, new cells are born with 3 or 6 neighbors, and the neighborhood is nearly, but not exactly, the full set of 8 adjacent cells in a grid.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'suspects_bug': None}
  summary='The function computes the next generation of cells in HighLife, where survival requires 2–3 neighbors and birth requires 3 or 6 neighbors, using neighbor checks and set operations.'

Results¶
In [36]:
plot_results(
    "Highlife",
    highlife_meaningful_names_results,
    highlife_meaningless_names_results,
    highlife_ignore_names_results,
    include_lengths=True,
)
No description has been provided for this image

$\pi$ Approximation¶

The following algorithm checks evenly spaced grid points within the unit square between the origin and the $(x, y) = (1, 1)$ point, and counts how many of them falls within the quarter of the unit circle which overlaps this unit square.

Using the number of points as an estimator for the area of the unit square (which is $1$) and for the area of the quarter of the unit circle (which is $\frac{1}{4}r^2\pi = \frac{1}{4}\pi$), the ratio between the two areas can be approximated. The resolution of the grid is iteratively increased until the difference between successive approximations becomes smaller than the specified threshold.

In [37]:
def approximate_pi_basic(target_precision, previous_pi_estimate=0.0, grid_resolution=10):
    grid_spacing = 1.0 / (grid_resolution - 1)
    points_within_unit_circle = 0

    for col_idx in range(grid_resolution):
        x = grid_spacing * col_idx
        y_squared_limit = 1.0 - x * x

        for row_idx in range(grid_resolution):
            y = grid_spacing * row_idx

            if y * y < y_squared_limit:
                points_within_unit_circle += 1

    current_pi_estimate = 4.0 * points_within_unit_circle / (grid_resolution * grid_resolution)

    if abs(current_pi_estimate - previous_pi_estimate) < target_precision:
        return current_pi_estimate

    return approximate_pi_basic(target_precision, current_pi_estimate, 2 * grid_resolution)


def test_approximate_pi(func):
    approx_pi = func(0.003)
    assert abs(approx_pi - math.pi) < 0.003, f"Approximation of pi too far from reality; {approx_pi=}"


test_approximate_pi(approximate_pi_basic)

Complications:

  1. Instead of exploiting the symmetries of the unit circle and the 4 unit squares around the origin that overlap with it, the algorithm will stretch out the grid across the entire circle and all 4 squares.

  2. Furthermore, the scaling by $4$ will be obscured by exploiting the following limit:

    \begin{align*} \lim_{x \to \infty} \frac{x^2}{(2x-1)^2} \ = \lim_{x \to \infty} \frac{x^2}{4x^2 - 4x - 1} \ = \frac{1}{4} \end{align*}

In [38]:
def approximate_pi(target_precision, previous_pi_estimate=0.0, grid_resolution=10):
    grid_size = 2 * grid_resolution - 1
    grid_spacing = 1.0 / (grid_resolution - 1)
    points_within_unit_circle = 0

    for col_idx in range(grid_size):
        x = grid_spacing * col_idx - 1.0
        y_squared_limit = 1.0 - x * x

        for row_idx in range(grid_size):
            y = grid_spacing * row_idx - 1.0

            if y * y < y_squared_limit:
                points_within_unit_circle += 1

    current_pi_estimate = points_within_unit_circle / (grid_resolution * grid_resolution)

    if abs(current_pi_estimate - previous_pi_estimate) < target_precision:
        return current_pi_estimate

    return approximate_pi(target_precision, current_pi_estimate, 2 * grid_resolution)


def f(a, b=0.0, c=10):
    d = 2 * c - 1
    e = 1.0 / (c - 1)
    g = 0

    for h in range(d):
        i = e * h - 1.0
        j = 1.0 - i * i

        for k in range(d):
            l = e * k - 1.0

            if l * l < j:
                g += 1

    m = g / (c * c)

    if abs(m - b) < a:
        return m

    return f(a, m, 2 * c)


def check_pi_approximation_explanation(response, eval_experiment_filename):
    system_prompt = "Please act as an unbiased, impartial judge and an expert in mathematics and programming."
    user_prompt = """\
Several AI assistants were given a task of decyphering and explaining a piece of obfuscated, \
highly mathematical Python code, and providing a brief summary of their findings. The \
explanations the assistants gave can be classified into the following three categories:

1. **Approximates pi**: the core function in the code is supposed to approximate the \
mathematical constant $\\pi$, and it does exactly that. (However, it may not be as efficient \
and fast to converge as conventional approximation methods.)

2. **Approximates pi/4**: the core function in the code is supposed to approximate $\\frac{\\pi}{4}$, \
and it does exactly that. (However, it may not be as efficient and fast to converge as conventional \
approximation methods, and its name may be unclear about whether it is approximating $\\pi$ or just \
a quarter of it.)

3. **Buggy code**: the core function in the code appears to contain a critical flaw in its \
calculation. For example, the code is supposed to calculate $\\pi$ but it actually approximates \
$\\frac{\\pi}{4}$, or vice versa, or generally speaking, it's not doing what it's supposed to do \
due to any kind of programming mistakes.

Notice the subtle nuances here:
- An assistant may think initially that the code is approximating pi/4, but then come to think \
that there's actually a scaling factor hidden implicitly in the final calculation which will \
ultimately yield an approximation of pi; in this case, the answer should be classified as class 1.
- However, if an assistant goes with the pi/4 assumption the whole time without explicitly \
mentioning that the code may ultimately be compensating for the scaling, then that answer should \
be classified as class 2, even if the final summary does not mention the 1/4 scaling explicitly.
- Class 3 is for any answer where the assistant suspects that the lack or presence of any kind of \
scaling, hidden or explicit, is probably a mistake by the programmer. Also, if an answer claims \
the presence of any kind of unintentional behaviors or bugs (e.g. wrong formulas, numerical \
stability problems, or anything which would prevent the code from approximating either pi or pi/4, \
regardless of the speed of convergence), then it should be classified as 3.
- However, pointing out slow convergence or the unconventional nature of the algorithm should not \
be considered as reasons for a class 3 classification, because these aspects were not important \
in the experiments. Therefore, even if an assistant frames these problems as critical flaws, if \
there are no other unintentional behaviors mentioned, then class 1 or class 2 may be a better fit.

In short, what matters is the expected output and the assumed intentionality of it according to \
the assitants. (It doesn't even matter if an assistant itself makes any mathematical mistakes or \
if it produces any hallucinations. It is also irrelevant whether the code in question really \
works the way the assistants describe it, therefore it will not be provided here.)

One of the answers from the AI assistants will be provided below. Please read it carefully, and \
briefly explain how well do you think it fits to each category and why. After that, please give \
your final classification as a single number in the following format, without any additional \
formattings (N can be either 1, 2, or 3):

Class: N

Here comes the assistant's answer:
"""
    user_prompt += f"\n\n--- BEGIN ANSWER ---\n{response}\n--- END ANSWER ---"

    judgement, thoughts = query_gpt4(eval_experiment_filename, system_prompt, user_prompt, temperature=0.0)

    eval_dir = os.path.join("data", os.path.dirname(eval_experiment_filename))
    eval_txt_filename = os.path.join("data", eval_experiment_filename) + ".txt"
    os.makedirs(eval_dir, exist_ok=True)

    with open(eval_txt_filename, "w") as f:
        print("System prompt:", file=f)
        print(system_prompt, file=f)
        print("", file=f)
        print("User prompt:", file=f)
        print(user_prompt, file=f)
        print("", file=f)
        print("Evaluation:", file=f)
        print(judgement, file=f)

    match = None

    for line in judgement.replace("\r", "\n").split("\n"):
        match = re.search(r"^[*#: ]*Class[*: ]*([123])[* ]*$", line, re.IGNORECASE)

    if not match:
        raise Exception(f"{eval_experiment_filename} FAIL")

    cls = match[1]

    if cls == "1":
        return True, {"pi_quarter": False, "suspects_bug": False}

    if cls == "2":
        return False, {"pi_quarter": True, "suspects_bug": False}

    return False, {"pi_quarter": False, "suspects_bug": True}


test_approximate_pi(approximate_pi)
test_approximate_pi(f)

pi_valid_re = re.compile(
    r"π|(\\pi)|([^a-z0-9]pi[^a-z0-9])|(the area of the unit circle)",
    re.IGNORECASE,
)

pi_invalid_ptrns = {
    "pi_quarter": re.compile(
        (
            r"((π|pi) */ *[4⁴])"
            r"|(\\frac\{ *\\*pi *\}\{ *4 *\})"
            r"|(actually, ((one)|1|a) quarter of it)"
            r"|(one-fourth of (π|pi))"
            r"|(miss(es a|ing) factor of 4)"
            r"|(should be multiplied by 4)"
            r"|(omits*( the (critical|typical))? (4x )?scaling)"
            r"|(omits the necessary scaling factor of 4)"
        ),
        re.IGNORECASE,
    ),
    "suspects_bug": re.compile(
        (
            r"(incorrect denominator)"
            r"|((denominator|contains a) bug)"
            r"|((error|bug)s* (that|which) prevents*)"
            r"|((mathematical|calculation|implementation) (error|bug)s*)"
            r"|(bug in (the|its) (calculation|approximation|estimation))"
            r"|(flawed normalization)"
            r"|(missing scaling step)"
            r"|(estimation formula appears to be incorrect)"
        ),
        re.IGNORECASE,
    ),
}

pi_meaningful_names = inspect.getsource(approximate_pi)
pi_meaningless_names = inspect.getsource(f)
Meaningful Names¶
In [39]:
pi_meaningful_names_results = explain_code(
    "pi-meaningful-names",
    pi_meaningful_names,
    check_pi_approximation_explanation,
    {"pi_quarter": None, "suspects_bug": None},
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle, doubling the grid resolution until the estimate converges within the specified precision threshold.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points within a unit circle and recursively refining the grid resolution until a target precision is achieved, though it contains a calculation error in the pi estimation formula.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting points within a unit circle on progressively finer grids until the estimate reaches the target precision.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of π by iteratively creating a finer grid of points and calculating the ratio of those points that fall within a unit circle until the result converges to a specified precision.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively estimates the value of pi by counting the fraction of evenly spaced grid points that fall inside the unit circle, refining the estimate until the result changes by less than the requested precision.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='The function recursively refines a grid to estimate π by counting points inside the unit circle, but its implementation contains mathematical errors that prevent accurate results.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points inside a unit circle, doubling the grid resolution until consecutive estimates converge within the specified target precision.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle, doubling the grid resolution until reaching a target precision, though it contains calculation errors that prevent it from computing the correct value of π.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting points within a unit circle on a grid, refining the estimate by doubling the grid resolution until the desired precision is achieved.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates the value of π by overlaying an increasingly dense grid of points on a square and calculating the ratio of points that fall within its inscribed unit circle until the estimate converges to a target precision.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates the value of π/4 (not π itself) by counting grid points inside the unit circle, increasing grid resolution until the change in estimates is below a desired precision.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='The function attempts to approximate π via recursive grid refinement over a unit circle, but contains mathematical errors in its estimation formula that prevent accurate results.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle and refining the grid resolution until reaching a target precision, though it contains a mathematical error in its π calculation formula.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and recursively refining the grid resolution until reaching a target precision, but contains calculation errors that prevent it from correctly computing π.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting points within a unit circle on a grid and refining the grid until the estimate reaches the desired precision.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi by counting points within a unit circle on a grid that doubles in resolution with each step until the estimate converges to a target precision.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='The function recursively approximates π/4 by counting grid points inside the unit circle over a square grid, refining the estimate until the change is below a desired precision threshold.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='**The function estimates π by counting grid points inside a unit circle and recursively refining the grid. However, it omits the critical 4x scaling and uses an inconsistent point total, resulting in mathematically incorrect estimates.**'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle with increasing resolution until the desired precision is achieved, though it contains a calculation error in the pi estimation formula.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and recursively refining the grid resolution until reaching a target precision, though it contains a mathematical error in the estimation formula.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points within a unit circle and refining the estimate until the desired precision is achieved.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates Pi by repeatedly doubling the resolution of a grid, calculating the ratio of points that fall inside a unit circle to the number of points in a 1x1 quadrant, and stopping once the result converges to a specified precision.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively refines a grid-based estimate of π by counting points within a unit circle until the difference between consecutive estimates falls below the specified precision, though its final calculation may omit the typical scaling factor.'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='The function attempts to approximate π via recursive grid refinement but contains mathematical errors (missing factor of 4 and incorrect point counting), leading to inaccurate results. Its deterministic approach fundamentally differs from Monte Carlo methods that use random sampling for estimation.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle versus total points in a square, doubling the grid resolution until the estimate converges within the specified precision, though it contains a mathematical error in the π calculation formula.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π using a grid-based sampling method that counts points inside a unit circle, recursively doubling the grid resolution until reaching a target precision, though it contains a bug in the pi estimation formula.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting points within a unit circle on a grid and refining the grid resolution until the desired precision is achieved, though it currently estimates π/4 due to a missing multiplication by 4.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of pi by calculating the ratio of points that fall within a unit circle on an increasingly dense grid until the result stabilizes to a desired precision.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function estimates π by recursively refining a grid, counting the proportion of points inside the unit circle, but it actually returns an estimate of π/4 due to missing a multiplying factor of 4.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting fixed grid points within a unit circle and recursively refining the grid, but contains mathematical errors that prevent correct π estimation.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points within a unit circle and refining the grid resolution until a target precision is reached, though it contains a calculation error in the pi estimation formula.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points inside a unit circle and refining the grid resolution until the estimate converges within a specified precision threshold.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points within a unit circle and refining the grid until the estimate reaches the desired precision.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary="The function recursively approximates π by counting points on an ever-finer grid that fall inside a unit circle, stopping when the estimate's improvement is smaller than a target precision."

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π using a grid-based method by counting points within a unit circle, refining the grid until the estimate stabilizes within a given precision—but omits the necessary scaling factor of 4, so it actually estimates π\u2009⁄\u20094, not π itself.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function estimates π by iteratively refining a fixed grid over the unit circle until successive approximations converge within a specified precision tolerance.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle and doubling the grid resolution until the desired precision is achieved, though it contains mathematical errors in the π calculation formula.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and using recursive refinement, but contains a bug in its calculation formula that prevents it from producing accurate π values.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively estimates π by counting points within a unit circle on a grid, refining the grid until the estimate meets a target precision, though it contains a bug in the estimation formula.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates the value of pi by calculating the area of a unit circle, using an increasingly fine grid of points, until the result converges to a desired precision.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function estimates π by recursively counting how many points in an increasingly fine grid covering a square fall inside the unit circle, refining the estimate until the change is smaller than the target precision.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='The function attempts to approximate π via a recursive grid-based approach but contains mathematical errors (incorrect point scaling and missing factor of 4), rendering its output inaccurate.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle with increasing resolution until consecutive estimates converge within a specified precision threshold.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and recursively refining the grid resolution until reaching a target precision, though the estimation formula appears to be incorrect for standard π approximation.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting points within a unit circle on progressively finer grids until the estimate reaches the desired precision.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi by calculating the ratio of points that fall inside a unit circle to the total points on an increasingly dense grid, stopping once the estimate stabilizes to a desired precision.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively computes an increasingly accurate approximation of π by measuring how many points fall within a unit circle on a symmetric grid, stopping when the estimate stabilizes to within the desired precision.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='The function recursively estimates π by counting grid points inside a unit circle and scaling by the inverse square of the grid resolution, halting when sequential estimates stabilize within a target precision.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle and refining the grid resolution until reaching a target precision, though it contains a mathematical error in the π calculation formula.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by recursively counting grid points inside a unit circle with increasing resolution until reaching a target precision, though it contains a calculation error in the pi estimation formula.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points within a unit circle and refining the grid until the estimate reaches the desired precision.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi by counting the number of points that fall inside a unit circle on an increasingly dense grid, stopping only when the estimate stabilizes to a user-defined precision.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively approximates the value of pi using a grid-based area estimation of the unit circle, refining its estimate until the desired precision is achieved.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π using a recursive grid-based method with a termination condition based on precision, but contains mathematical errors that prevent accurate estimation.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting grid points inside a unit circle, doubling the grid resolution until the estimate converges within a specified precision threshold.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate pi by counting grid points inside a unit circle and recursively refining the grid resolution, but contains a mathematical error in the pi calculation formula that prevents it from producing correct results.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary="This function approximates π by recursively refining a grid-based estimation of the unit circle's area until the desired precision is achieved."

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively refines an estimate of Pi by counting points inside a unit circle on an increasingly fine grid until the result stabilizes to a target precision.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by deterministically counting grid points inside the unit circle, refining the grid until the estimate converges within a specified precision.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function estimates π by recursively refining a symmetric grid count of unit circle points, using doubling resolution until successive approximations fall below a precision threshold.'

Meaningless Names¶
In [40]:
pi_meaningless_names_results = explain_code(
    "pi-meaningless-names",
    pi_meaningless_names,
    check_pi_approximation_explanation,
    {"pi_quarter": None, "suspects_bug": None},
    repeats=REPEATS,
    temperature=TEMPERATURE,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively estimates π/4 using a grid-based Monte Carlo method, doubling the grid resolution until the estimate converges within a specified tolerance.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 by counting grid points inside a unit circle with progressively finer resolution until the estimate converges within a specified tolerance.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This is a recursive function that estimates π/4 by counting points within a unit circle using increasingly dense grids until the estimate stabilizes within a given threshold.'

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi using a Monte Carlo method, refining its estimate with an increasingly dense grid of points until the result converges to a specified precision.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function numerically estimates π/4 by counting how many points on a grid fall inside the unit circle within a square, recursively refining the grid until the estimate stabilizes within a given tolerance.'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π via unit circle point counting, refining grid density until successive estimates stabilize within tolerance `a`.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π using a Monte Carlo-style grid sampling method, doubling the resolution until the approximation converges within a specified tolerance.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively computes π to a specified precision using a grid-based approximation that counts points inside a unit circle.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This is a recursive function that approximates π/4 using a deterministic grid method, refining the estimate by doubling grid resolution until the change between iterations falls below a given threshold.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates the value of Pi (π) using a Monte Carlo method, successively increasing the number of sample points until the result converges to a user-specified precision.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 by counting the fraction of square grid points within a unit circle and continues until the estimate converges within a specified tolerance.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively estimates π via grid-based point counting, refining precision by doubling grid density until successive approximations converge within tolerance `a`.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively approximates π/4 using a grid-based Monte Carlo method, doubling the grid resolution until consecutive approximations differ by less than the specified tolerance.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function computes π to a specified tolerance using a grid-based Monte Carlo-like method with recursive refinement that doubles the resolution until convergence is achieved.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively approximates π/4 (the area of a quarter-unit circle) using a grid-based Monte Carlo method until the result stabilizes within a given tolerance.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates the value of Pi (π) by using a Monte Carlo method, increasing the sample size in each step until the result converges to a user-specified precision.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function computes a grid-based numerical estimate of π/4 (the area fraction of the unit circle within its bounding square) by iteratively increasing resolution until a specified precision is reached.'

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points inside a unit circle and refining the grid until successive estimates differ by less than the specified tolerance `a`.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 using a grid-based Monte Carlo method, doubling the resolution until the estimate converges within a specified tolerance.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function estimates π/4 using a grid-based Monte Carlo method that counts points inside a unit circle and recursively refines the calculation until it converges within a specified tolerance.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 (area of a unit circle quadrant) using a grid approximation method, refining the estimate until the change between iterations falls below a given threshold.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates the value of Pi (π) using a Monte Carlo method, stopping when the difference between successive approximations is less than a specified tolerance.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary="This function recursively estimates the ratio of the unit circle's area to its bounding square using a grid and refines its approximation until a given accuracy is reached, effectively approximating \\(\\pi/4\\)."

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points inside a unit circle and refining the grid until successive estimates differ by less than a given tolerance.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function implements an adaptive grid-based Monte Carlo method to estimate π/4 by counting points inside a unit circle within a square, recursively increasing grid resolution until the estimate converges within a specified tolerance.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates the value of π using a grid-based Monte Carlo method with recursive refinement until the desired accuracy (parameter `a`) is achieved.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary="This function recursively approximates π/4 (the area of a unit circle's quarter) using a grid-based Monte Carlo method, refining the grid until the result stabilizes within a given tolerance."

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates the value of Pi using a Monte Carlo method, refining its grid-based simulation until the result converges to within a user-specified precision.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 using a discrete grid-based area approximation of the unit circle, refining the result until it converges within a specified tolerance.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function computes an approximation of π by counting grid points inside the unit circle and recursively refines the estimate until achieving the desired precision.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 using a grid-based Monte Carlo method, doubling the grid resolution until the estimate converges within a specified tolerance.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function computes π/4 to a specified precision using a grid-based sampling method that counts points inside a unit circle and recursively refines the approximation until convergence.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively approximates the area of a unit circle (π/4) using a grid-based sampling method until the result stabilizes within a given tolerance.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi using a Monte Carlo simulation, progressively doubling the sample points until the estimate stabilizes within a given precision.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function numerically approximates π/4 by counting the fraction of grid points inside the unit circle in the square [-1, 1] × [-1, 1], doubling the grid resolution until the estimate converges within a specified tolerance.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function adaptively refines a grid to approximate π by counting points inside the unit circle, terminating when successive estimates differ by less than the input tolerance `a`.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively approximates π/4 using a grid-based Monte Carlo method, doubling the grid resolution until consecutive estimates differ by less than the specified tolerance.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function estimates π by counting grid points inside a unit circle and recursively refining the grid resolution until the estimate converges within a specified tolerance.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary="This function recursively approximates π/4 (a quarter of the unit circle's area) using a grid-based Monte Carlo method until the result stabilizes within a given tolerance."

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates the value of Pi (π) by calculating the ratio of points that fall within a unit circle versus the total points in a progressively finer grid, stopping when the result stabilizes to a user-defined precision.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively approximates the fraction of points in a square grid (from -1 to 1) that fall within the unit circle, effectively estimating π/4 via a deterministic grid-based method to a specified precision.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points inside the unit circle and refining the grid until successive estimates converge within a tolerance `a`.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function implements a recursive grid-based Monte Carlo approximation of π/4, doubling the sampling resolution on each iteration until the estimate converges within a specified tolerance.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 by counting grid points inside a unit circle, doubling the resolution until the estimate converges within the specified tolerance.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This is a recursive function that estimates π/4 by counting points within a unit circle using progressively finer grids until the result stabilizes within a given threshold.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi (π) by repeatedly sampling points on an increasingly fine grid and calculating the ratio of points that fall within a unit circle until the result converges to a specified tolerance.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function iteratively estimates the area of the unit circle by counting grid points inside the circle and returns the result with a specified tolerance.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively estimates π by counting grid points inside a unit circle and refining the grid until successive approximations converge within tolerance `a`.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 using a Monte Carlo-like grid sampling method, doubling the resolution until the difference between consecutive estimates is less than the specified tolerance `a`.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively computes π by counting grid points inside a unit circle with increasing resolution until the estimate converges within a specified tolerance.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This is a recursive function that estimates the value of π/4 by counting points inside a unit circle and progressively refining the approximation until it converges within a given threshold.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively calculates an approximation of Pi using a Monte Carlo method, refining the estimate with more sample points in each step until it converges to a specified precision.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 (the area of the unit circle within a square) using grid point counting, refining the approximation until the desired precision is reached.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='The function recursively estimates π/4 by counting grid points inside a unit circle and stops when successive approximations converge within tolerance `a`.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively approximates π/4 using a grid-based method that counts points inside a quarter unit circle, doubling the grid resolution until the desired accuracy is achieved.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates π/4 using a Monte Carlo method by counting points inside a unit circle on increasingly fine grids until the estimate converges within a specified tolerance.'

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This is a recursive function that estimates π/4 by counting points inside a unit circle quadrant using progressively finer grids until the result stabilizes within a given tolerance.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary="This function recursively approximates the value of Pi using a Monte Carlo method, increasing the simulation's resolution with each step until the result stabilizes to a user-defined precision."

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='The function estimates \\(\\pi/4\\) by counting grid points inside the unit circle on a square mesh and recursively increases accuracy until the result converges within a given tolerance.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points inside the unit circle and refining the grid until successive estimates agree within tolerance `a`.'

Ignore Names¶
In [41]:
pi_ignore_names_results = explain_code(
    "pi-ignore-names",
    pi_meaningful_names,
    check_pi_approximation_explanation,
    {"pi_quarter": None, "suspects_bug": None},
    repeats=REPEATS,
    temperature=TEMPERATURE,
    extra_instructions=ignore_names,
)
# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π using a grid-based sampling method that counts points inside a unit circle, but contains a mathematical error in the final calculation that prevents it from actually computing π correctly.'

# len(backlog)=9
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by counting grid points inside a unit circle within a square [-1,1]×[-1,1], using a scaling trick and recursive refinement until reaching the desired precision.'

# len(backlog)=9
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=0
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary="This function approximates π by progressively refining a grid-based estimation of a unit circle's area until reaching the desired precision."

# len(backlog)=9
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively estimates the value of Pi by counting the number of points on an ever-finer grid that fall within a unit circle, stopping when the estimate stabilizes to a desired precision.'

# len(backlog)=9
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=0
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively refines the estimate of the area proportion inside a unit circle on a grid—effectively converging to \\(\\frac{\\pi}{4}\\)—by counting interior grid points, but due to an unscaled return value, it approximates \\(\\frac{\\pi}{4}\\), not \\(\\pi\\).'

# len(backlog)=9
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=0
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively refines a grid to count points inside the unit circle, estimates π via geometric symmetry, and terminates when successive estimates meet a precision target.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and recursively refining the grid resolution, but contains a bug in the ratio calculation that prevents it from correctly estimating π.'

# len(backlog)=8
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates the value of π using a Monte Carlo-like method by counting grid points inside a unit circle and recursively refining the grid resolution until the estimate converges within a specified precision.'

# len(backlog)=8
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points within a unit circle and refining the grid until successive estimates differ by less than the target precision.'

# len(backlog)=8
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates the value of pi by counting the number of points that fall inside a unit circle on an increasingly dense grid, stopping when the result converges to a specified precision.'

# len(backlog)=8
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=1
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively increases the resolution of a square grid, counting the fraction of grid points inside the unit circle, and returns this fraction (with a denominator that leads to overestimation), stabilizing when consecutive estimates differ by less than the target precision.'

# len(backlog)=8
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=1
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function estimates π by recursively refining a grid over a unit circle domain until successive approximations meet a specified precision threshold.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by counting grid points inside the unit circle on increasingly fine grids, recursively doubling the resolution until the estimate converges within a specified precision.'

# len(backlog)=7
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates the value of π by counting grid points inside a unit circle, using a recursive grid refinement approach that doubles the resolution until consecutive estimates differ by less than the target precision.'

# len(backlog)=7
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by progressively refining a grid of points within a unit circle until successive estimates differ by less than the target precision.'

# len(backlog)=7
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function approximates the value of Pi by iteratively counting the number of points on an increasingly fine grid that fall within a unit circle until the result stabilizes to a desired precision.'

# len(backlog)=7
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=2
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary="This function estimates a value (though not exactly π) by repeatedly doubling a grid's resolution, counting the fraction of grid points within the unit circle over a fixed normalization, and recursing until the estimate converges within a target precision."

# len(backlog)=7
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=2
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by counting grid points inside the unit circle, scaling the count by the square of the grid resolution, and recursively doubling the grid density until successive estimates converge within a specified precision.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and using recursive refinement, but contains a critical bug in its calculation that causes it to converge to an incorrect value around 2.8-2.9 instead of π.'

# len(backlog)=6
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function estimates the value of π using a grid-based Monte Carlo method that counts points inside a unit circle, recursively refining the grid resolution until the estimate converges within the specified precision.'

# len(backlog)=6
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=3
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by recursively refining a grid-based area estimation of the unit circle until successive estimates agree within the specified precision.'

# len(backlog)=6
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates Pi by counting the points of an increasingly fine grid that fall inside a unit circle, stopping when the estimate stabilizes to a given precision.'

# len(backlog)=6
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=3
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function incrementally refines an estimate of the area inside a unit circle by counting grid points within the circle on ever-denser square grids, and returns the result (which approximates π/4, though with an unusual and possibly mistaken normalization).'

# len(backlog)=6
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=3
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function approximates π by counting grid points inside the unit circle and refining the grid recursively until estimates stabilize within a specified precision.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π using a grid-based sampling method that counts points inside a unit circle, but contains a mathematical error in dividing by grid_resolution² instead of the actual number of sampled points, resulting in incorrect estimates.'

# len(backlog)=5
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function iteratively refines a numerical estimate by counting grid points inside a unit circle on progressively finer grids, stopping when the change between iterations falls below a specified precision threshold.'

# len(backlog)=5
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=4
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by counting points in a progressively finer grid that fall within a unit circle, recursively doubling the grid resolution until successive estimates differ by less than the target precision.'

# len(backlog)=5
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively refines a grid of points laid over a square to count how many fall within an inscribed circle, using the ratio of these counts to calculate an increasingly accurate approximation of Pi until a target precision is met.'

# len(backlog)=5
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=4
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively refines a grid-based estimate for the area of a unit circle by counting points whose coordinates fall inside the circle, aiming to approximate π (or a multiple thereof), but the precise result may be off due to an inconsistency in how it divides the count of points inside the circle by the nominal number of total grid points.'

# len(backlog)=5
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=4
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function estimates π by counting grid points inside the unit circle relative to a scaled denominator, recursively doubling grid resolution until successive estimates stabilize within a specified precision threshold.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and comparing to the total area, but contains a calculation error in the denominator that prevents it from actually computing π correctly.'

# len(backlog)=4
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function estimates the value of π by counting grid points inside a unit circle and using a normalization that makes the estimate converge to π through recursive grid refinement.'

# len(backlog)=4
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively approximates π by counting points within a unit circle on increasingly finer grids until reaching the desired precision, though it currently contains mathematical and implementation errors.'

# len(backlog)=4
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi by creating an increasingly dense grid of points over a square, calculating the ratio of points that fall within an inscribed unit circle to the points within a unit area, and stopping when the result converges to a specified precision.'

# len(backlog)=4
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=5
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function recursively estimates the area ratio of the unit circle to its circumscribing square using grid sampling, refining the estimate until the change falls below a target precision—but as written, its return value is **not π or π/4**, but an unusual normalization likely leading to an incorrect approximation without further adjustment.'

# len(backlog)=4
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=5
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function iteratively refines a grid to count points inside the unit circle, scaling the count by grid density to estimate π, and recurses until successive estimates differ by less than the target precision.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π using a grid-based Monte Carlo-like method by counting points inside a unit circle, but contains a calculation error that causes it to converge to an incorrect value rather than π.'

# len(backlog)=3
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by counting grid points inside a unit circle and using a clever normalization factor, recursively refining the grid until reaching the desired precision.'

# len(backlog)=3
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=6
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates π by counting grid points within a unit circle, progressively refining the grid until successive estimates differ by less than the target precision.'

# len(backlog)=3
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=6
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi by overlaying a progressively denser grid of points on a square containing a unit circle and using the count of points inside the circle to refine its estimate until a target precision is achieved.'

# len(backlog)=3
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function repeatedly estimates the area of a unit circle via a grid-point counting approach, refining the estimate recursively until changes are smaller than a given threshold, but due to a missing scaling factor, it substantially underestimates the true value of π.'

# len(backlog)=3
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=6
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='The function attempts to approximate π using a recursive grid-based approach, but its flawed scaling (denominator and missing factor of 4) prevents it from converging to the correct value of π.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and recursively refining the grid resolution until convergence, but contains a mathematical error in the estimation formula that prevents it from actually calculating π correctly.'

# len(backlog)=2
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by counting grid points inside a unit circle within a square from -1 to 1, using a clever normalization factor and recursive grid refinement to achieve a specified precision.'

# len(backlog)=2
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='The function attempts to approximate π via Monte Carlo integration using grid points in a unit square, but contains a scaling error that makes it estimate π/4 instead of π.'

# len(backlog)=2
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function recursively approximates π by calculating the ratio of the number of points falling inside a unit circle to the number of points within one quadrant of its bounding square, using an increasingly finer grid until the result converges to a target precision.'

# len(backlog)=2
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=7
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function recursively estimates the fraction of evenly-distributed points inside a unit circle inscribed in a square, but due to incorrect normalization, its returned value approximates π/4—not π itself.'

# len(backlog)=2
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=7
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function computes π by counting grid points inside the unit circle, scaling the count by `1/(grid_resolution²)`, and recursively doubling resolution until estimates stabilize within a target precision.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=8
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary='This function attempts to approximate π by counting grid points inside a unit circle and recursively refining the grid resolution, but contains a critical bug in the area calculation that prevents it from converging to the correct value of π.'

# len(backlog)=1
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates the value of π by counting grid points inside a unit circle and recursively refining the grid resolution until reaching a target precision.'

# len(backlog)=1
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively estimates π by counting points within a unit circle on progressively finer grids until reaching the desired precision, using an unconventional but mathematically valid normalization approach.'

# len(backlog)=1
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of π by using a numerical integration technique that counts points on an ever-finer grid within a unit circle, stopping once the estimate stabilizes to a desired precision.'

# len(backlog)=1
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates a value close to π by counting how many points lie inside a unit circle traced over a square grid and tracking convergence of the ratio, but it uses a nonstandard normalization, so its result only trends toward π as grid density increases.'

# len(backlog)=1
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=8
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='The function approximates π via recursive grid refinement over \\([-1, 1]^2\\), counting points inside the unit circle normalized by \\(\\text{grid\\_resolution}^2\\), and terminates when estimates stabilize within a target precision.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively estimates π by counting grid points inside a unit circle and dividing by a scaled grid area, refining the estimate by doubling the resolution until the change between iterations is below the target precision.'

# len(backlog)=0
# model_name='claude-opus-4-20250514', reasoning_budget=16000, tries=0, i=9
#   is_valid=False invalid={'pi_quarter': False, 'suspects_bug': True}
  summary="This function uses a grid-based sampling method with recursive refinement to compute a value related to the area of a unit circle, but due to an incorrect divisor in the calculation, it doesn't actually approximate π correctly."

# len(backlog)=0
# model_name='deepseek-chat', reasoning_budget=0, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by progressively refining a grid-based estimation of the area of a unit circle until reaching the desired precision.'

# len(backlog)=0
# model_name='gemini-2.5-pro-preview-06-05', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function recursively approximates the value of Pi by counting the points on an increasingly fine grid that fall within a unit circle, stopping when the resulting estimate no longer changes significantly.'

# len(backlog)=0
# model_name='gpt-4.1-2025-04-14', reasoning_budget=0, tries=0, i=9
#   is_valid=False invalid={'pi_quarter': True, 'suspects_bug': False}
  summary='This function numerically estimates the value of π divided by 4 (π/4) by counting how many regularly-spaced points in a squared grid fall inside the unit circle, making the estimate increasingly accurate by recursively refining the grid until the change is smaller than the given precision target.'

# len(backlog)=0
# model_name='sonar-reasoning-pro', reasoning_budget=16000, tries=0, i=9
#   is_valid=True invalid={'pi_quarter': False, 'suspects_bug': False}
  summary='This function approximates π by counting grid points inside a unit circle across recursively refined grids, terminating when successive estimates stabilize within a target precision.'

Results¶
In [42]:
plot_results(
    "$\pi$ Approximation",
    pi_meaningful_names_results,
    pi_meaningless_names_results,
    pi_ignore_names_results,
    include_lengths=True,
)
No description has been provided for this image
In [ ]: