Summary
A contained patch-and-retry loop where an agent observes its own failures, proposes changes, and keeps iterating until the test passes or the budget runs out.
Patch-and-Retry Sandbox
A recursive agent experiment that writes code, reads the traceback, patches itself, and retries inside a sandbox.
Project Brief
Summary
A contained patch-and-retry loop where an agent observes its own failures, proposes changes, and keeps iterating until the test passes or the budget runs out.
Problem
One-shot code generation produces brittle results, especially when the real failure only appears after execution.
Hypothesis
If an agent can fail safely, read the failure clearly, and patch iteratively, its reliability improves more than it would through one-shot generation alone.
Outcome
Built and ran the loop against 7 real problems. All 7 solved. Every problem failed on attempt 1 and passed on attempt 2 — including FizzBuzz with two interacting bugs and a merge sort with a subtle index error.
Goals
Technologies Used
The point of this project is not that the agent writes perfect code. The point is that the system gives the agent a contained place to fail, a readable signal about why it failed, and permission to try again. That changes the reliability story completely.
By wrapping generation, execution, traceback parsing, and patching into one controlled loop, the project turns runtime errors into fuel for the next attempt. The success condition is not “looks plausible.” It is “the test actually passed.”
The main lesson was that reliability does not come from asking the agent to be smarter in a single shot. It comes from building a system around the agent that makes failure cheap, observable, and correctable.
01
0 of 7 problems passed on first attempt. Every problem required the loop to fire — confirming that one-shot generation alone would have failed the full batch.
02
All 7 solved on attempt 2. The traceback-as-feedback pattern worked across every error class tested, including wrong-output cases where there was no exception to parse.
03
FizzBuzz had two interacting bugs — a wrong range and a wrong check order — and the patcher resolved both in one patch cycle after reading the structured assertion failure.
Analysis
Attempt Timeline — Failure to Pass Per Problem
Loading chart...
7 problems across 5 error classes: SyntaxError, NameError, IndexError, TypeError, and wrong output. Every problem failed on attempt 1 and passed on attempt 2. The loop was driven by real Codex CLI calls — each failed test output was passed directly to the patcher as a structured signal. FizzBuzz carried two interacting bugs (wrong range and wrong check order); the patcher identified and fixed both in a single pass.
[ Connect ]
If you are exploring agent loops, sandboxes, or self-correcting automation, I am especially interested in those conversations.
You are reaching
John Meyer
Security Engineer → AI