I Built a Pokémon Crystal ROM Hack With Codex

I have built software without writing a line of code before: TypeScript apps, Python scripts, and little internal tools where the whole experience is basically telling an AI agent what I want, waiting a bit, then inspecting the result like Anton Ego taking a bite of ratatouille.

This time I built a Pokémon Crystal ROM hack with Codex. I’m used to web apps with React components, JSON APIs, and well structured stack traces that look like everything else on the internet. I haven’t written C or assembly since college CS classes. Good thing for me is that Codex has :) it has no problem with raw assembly, banked memory, jmp commands, WRAM addresses, sprite palettes, Game Boy sound effects, and emulator debugger scripts. A blue screen of death strikes fear into my heart, but Codex has no qualms whatsoever.

The absolutely bananas part of this whole thing is that it worked.

Codex prompt and implementation notes for Safari Gauntlet mode — My initial prompt to kick off the first version.

The project is called Safari Gauntlet. It turns Pokémon Crystal, through Polished Crystal, into a small roguelite-style challenge mode where you start in a compact hub, bring or receive one Pokémon, draft a team in the Safari Zone, climb a battle ladder, earn BP (battle points/in-game currency), buy upgrades and TMs, and keep one winner for future runs.

It is not some legit AAA game, which is what makes it really fun. It molds my favorite parts about indie games into one of my favorite franchises of all time (which also happens to be playable on my ModRetro Chromatic).

Reminds me of this tweet.

Safari Gauntlet title screen running in mGBA next to the Lua scripting console — The ROM running in mGBA beside the Lua scripting console, which eventually became the main way Codex could inspect and drive the game.

This Was Not the Easy Version of AI Coding#

AI coding is very useful at writing another CRUD endpoint, a button component, a SQL query, or a script that renames files in a folder. All the kinds of things that are well represented in the training data.

Assembly, unfortunately (or fortunately, depending on how much you enjoy pain), is not as easy.

The Pokémon Crystal disassembly is a cool project that lets you hack on the source code, but that source code is still Game Boy software. You are not writing in a cozy managed runtime, there is no DOM, and there is no npm install please-fix-my-banked-memory. You are pushing bytes around a tiny machine with all sorts of sharp edges.

Safari Gauntlet touched code like:

engine/events/safari_gauntlet.asm for the run state, rewards, settings, and flow
maps/BattleFactory1F.asm for the new main hub the game starts you in
ram/wramx.asm for run persistence
engine/overworld/wildmons.asm for draft encounters
engine/battle/read_trainer_party.asm for battle ladder tuning
Lua scripts that read addresses like 0xdcac, 0xdcae, and 0xffe5 out of emulator memory (😬)

I got to form fit the code inside a 1998 handheld game! I would say something like:

Add a TM merchant that has a rotating inventory of TMs, bought by BP.

Then I would go to sleep.

In the morning I would wake up to a ROM that built, a set of screenshots from mGBA, a new Lua verifier proving the menu opened without freezing, and a short report explaining what changed (on the good days).

Other days it would not work and all, I would try to talk to Nurse Joy and it wouldn’t let me, or the game would freeze on start up or turn all the text into mangled sprites from the world map. A few years ago, I would’ve given up by something like that, nowadays I just say “hey Codex, when I talk to nurse Joy the game freezes” and several thousand tokens later and it’s working again.

The Harness Was the Unlock#

In the face of Nurse Joy adversity and to keep progress marching on, Codex built itself a bunch of tools.

OpenAI has a great post on harness engineering that describes the shift well. The short version is: if agents are going to do real work, humans have to design the environment around them. The job moves from typing code to specifying intent, exposing the right tools, and building feedback loops that make the work verifiable.

That’s what helped a ton here.

At first, “make a ROM hack” is a terrible prompt. There are too many invisible states. Did the ROM compile? Did it boot? Did the player land in the correct map? Did a menu open? Did the script freeze? Did the game crash after the fade-out? Did a setting persist after save and reload? You cannot solve those questions by vibes. You need the agent to see the game.

So Codex started making the game legible to itself.

It created repo-local skills:

.codex/skills/mgba-lua-debugging/SKILL.md
.codex/skills/safari-gauntlet-verification/SKILL.md

Those skills taught future Codex runs how to work in the repo: launch mGBA muted, run at fast-forward speed, prefer Lua scripts over manual clicking, write logs to /tmp, save screenshots, fail on crash codes, and inspect visible emulator state before claiming success.

Then it wrote tools/run_safari_gauntlet_mgba.sh, a helper that copies the rebuilt ROM into a hash-specific temp directory. It wrote Lua scripts to press buttons, walk around maps, talk to NPCs, open shops, check party state, seed battle state, inspect WRAM, and detect crash codes.

By the end of the first version, the Safari Gauntlet commit range had added about 12,800 lines of source and docs text. About 8,600 of those lines were harness scaffolding: Lua verifiers, repo-local skills, and the mGBA launcher. There are around ~40ish focused Lua verifier files at the time of writing.

I think AI has played the game about 1,000x more than me now.

Codex verification evidence with mGBA screenshots for the Safari Gauntlet TM shop, PC, and hub — This became the normal workflow: code changed, Lua verifiers ran, and Codex came back with logs plus screenshots from the actual ROM.

Codex Wrote the Weird Tests Too#

The testing setup is my favorite part of the whole project.

In a normal web app, an agent can use Playwright, read the DOM, click a button, and assert that text appears. Here, the app is a Game Boy ROM so the “browser” or client I used was mGBA. The “DOM” is whatever pixels are currently on screen and whatever bytes happen to be in memory.

Codex figured out that mGBA has Lua scripting support and started using it as the interface layer, not as a cute demo, but as the regression harness. I didn’t know this thing existed until one day while it was building some features on codex --yolo mode it opened up on my screen.

The verifiers could run the game in fast-forward mode at 10x the frame rate, drive input frame by frame, and check exact behavior:

boot reaches the Safari Gauntlet hub
the hub does not immediately auto-start a run
the receptionist starts the expected flow
the nurse is reachable
the PC opens
the settings NPC opens and persists values
the TM vendor opens the correct rotating inventory
the move tutor reaches the move-selection flow
the Safari draft uses the expected level and encounter rules
a run can progress through draft and battle phases
save and reload preserve BP, stats, settings, and kept Pokémon

This is where the “AI code is slop” discourse starts to feel too small and gets a little more nuanced for me.

Was every first attempt perfect? Of course not. Sometimes Codex made a mess. Sometimes it overbuilt. Sometimes it put a value in the wrong place or trusted something that would appear totally dumb to a human. Sometimes the script reached a title screen and thought it had reached the hub because WRAM still had a value from a prior run.

A Safari Gauntlet mGBA screenshot with corrupted menu text after a ROM hack bug — Sometimes “the menu opened” was technically true, while the actual menu looked like this (which is why screenshots beat vibes).

An mGBA crash screen from Pokémon Polished Crystal reporting error 005 stack overflow — Other times the game was less subtle about it.

You can solemnly declare the entire approach doomed, or you can say:

That is slop. Fix the slop.

With as silly a prompt as that Codex would tighten a verifier or add a screenshot check. That and about a dozen other things it was smart enough to figure out on its own.

That is the part I think many critiques of AI coding miss. AI coding is a game of token economics, and it is cheap. You are not stuck with the first blob it gives you (thank goodness); you can keep pressing the system toward the shape you want, and if you are serious about the harness, the corrections compound.

Return to monke and embrace the fact that the massive magical compute engine that is the LLM is a lot smarter than you.

Human Taste Still Mattered#

You still need a human to call some shots, maybe that won’t last forever, but for now, that’s the case.

I decided what I wanted the game to feel like. I chose to make the playable loop shorter and that a rotating TM shop is more interesting than a giant static list. Little decisison that improve the UX for the player.

In another essay, AI Gave Everyone an Army, but Not Everyone is a General, I argued that AI made judgment more important and code less important. This project felt like a pure example of that.

Instead of soldiering to connect each tiny wire I got to decide what we were building, what counted as done, and when the output did not taste right yet.

Codex could write the BP merchant, and honestly, I probably couldn’t have without a lot of reading and frustration. I still had to decide whether the merchant should exist though.

That is not a smaller job so much as a different one.

”But Is It Maintainable?”#

Probably the most common objection to AI coding is that it creates unmaintainable slop.

My answer after this project is: yes, it absolutely can (so can humans, and have you seen humans?).

The difference is that the marginal cost of another cleanup pass is tiny. If Codex writes something ugly, I can ask it to make it smaller. If it duplicates a pattern, I can ask it to consolidate. If it makes a brittle verifier, I can ask it to harden the verifier. If a rule matters, I can ask it to encode that rule in the repo so the next run inherits the lesson.

Where taste gets encoded shifts. Instead of being expressed only through the code you personally type, taste gets expressed through prompts, review comments, repo-local docs, skills, verifiers, and the willingness to reject output until it matches the thing in your head.

The OpenAI harness post calls this kind of work “entropy and garbage collection.” I like that framing because AI generates, humans direct, the harness remembers, and the cleanup loop keeps the project from becoming what my gen-Z colleagues call “dogwater”.

Why This Made Me More Optimistic#

I came away from this project more bullish on AI, not less. Codex ain’t perfect, but it’s better than me!

There is a kind of project that used to die at the first wall of unfamiliarity. I could imagine Safari Gauntlet, describe the game loop, and tell whether the result was fun. But realistically, was I going to become fluent enough in Game Boy assembly, Polished Crystal internals, mGBA debugging, and emulator automation to build the whole thing myself by hand?

No, I have a kid, a job and a wife that really wants me to go hang up those hooks in the garage that I said I’d do a few days ago.

The machines give us a lot of room to experiment in domains we’d otherwise never touch.

A small clip of the thing actually running. The point of the harness was to make this visible state testable, not just to produce assembly that technically compiled.

And you can actually play it!

The software game is changing fast, and I want to build more like this.