There’s a technique from the early days of game programming that I keep coming back to. It sounds wrong the first time you hear it. The idea is that the fastest way to draw a sprite isn’t to read its pixels and copy them to the screen. It’s to compile the sprite into machine code and then run it.
The sprite doesn’t get drawn. The sprite does the drawing.
That’s compiled sprites. And understanding why they worked requires understanding exactly what makes regular sprite drawing slow.
The Cost of Drawing a Sprite the Normal Way
When a game draws a software sprite onto the screen, the straightforward approach is a per-pixel loop. For each pixel in the sprite, check whether it’s transparent (using a mask), and if it isn’t, copy it to the corresponding location in video memory.
In Z80 assembly on a ZX Spectrum, a simple masked sprite routine looks roughly like this:
draw_sprite:
ld b, SPRITE_HEIGHT
row_loop:
ld c, SPRITE_WIDTH
col_loop:
ld a, (hl) ; load mask byte
inc hl
and (de) ; AND with screen pixel
ld e, a
ld a, (hl) ; load sprite pixel
inc hl
or e ; combine with masked screen
ld (de), a ; write to screen
inc de
dec c
jr nz, col_loop
; advance to next row...
dec b
jr nz, row_loop
ret
(The assembly here is illustrative shorthand. AND (DE) isn’t a valid Z80 form; AND only accepts a register, (HL), or an immediate value. A real implementation loads operands through HL and saves intermediates in registers. The cycle counts below reflect the actual instruction sequences, not this simplified form.)
Every pixel costs you: two memory reads (mask and sprite data), an AND, an OR, a write to video memory, loop overhead, and a branch. On a Z80 running at 3.5 MHz, that adds up. A 16x16 color sprite (4bpp) is 128 bytes of work, each byte covering two pixels. Each byte operation runs to about 60-75 T-states. That’s around 8,000-10,000 T-states per sprite. At 50 frames per second, the total frame budget is about 70,000 T-states, all game logic included. Three or four sprites on screen and you’re already spending a meaningful fraction of your frame budget just on drawing.
The mask check is unavoidable if you want transparency. You can’t just blit the raw bitmap, because most sprites aren’t rectangular; they have holes where the background should show through. So every pixel has to be tested.
Or does it?
Sprites as Programs
Here’s the key insight: the mask is static. The pixel data is static. For a given sprite frame, you already know, at load time, exactly which pixels are opaque and which are transparent. You don’t need to check at runtime. You can bake that knowledge into the code itself.
A compiled sprite takes the pixel data and converts it into machine code instructions that write each opaque pixel directly, and do nothing for transparent pixels. There’s no loop. No mask reads. No branches. Just a linear sequence of store instructions with pixel values burned directly into the code as immediate operands.
For a fragment of that same 16x16 sprite, the compiled version looks like this:
; These instructions are generated once at load time
ld a, 0xE7
ld (0xC040), a ; write pixel at row 0, col 2
ld a, 0xE7
ld (0xC041), a ; write pixel at row 0, col 3
ld a, 0xDB
ld (0xC042), a ; write pixel at row 0, col 4
; ... transparent pixels: nothing here at all
ld a, 0xFF
ld (0xC060), a ; write pixel at row 1, col 0
; ... and so on for every opaque pixel
ret
(The Z80 has no single instruction for storing an immediate value to an absolute address. The compiled code uses a two-instruction sequence: LD A, n loads the pixel value into the accumulator in 7 T-states, then LD (nn), A stores it to the screen address in 13 T-states. Each opaque byte costs 20 T-states.)
No reads. No branches. For a 16x16 sprite that’s 60% opaque (roughly 77 opaque bytes), that’s around 1,600 T-states. Compare that to the 8,000-10,000 T-states for the masked blit. You’ve cut drawing time by a factor of five or six, with the same visual result.
The transparent pixels cost literally nothing. They don’t exist in the compiled code.
Positioning the Sprite
There’s a catch. Those addresses are hardcoded into the instructions. If the sprite moves, every address in the compiled code needs to change.
The solution is self-modifying code. Before calling the compiled sprite routine, you patch the address operands inside the instructions to reflect the new screen position. On most 8-bit CPUs, the address is stored as a 16-bit value inline in the instruction encoding, so patching it is just two memory writes. You update the addresses once per frame, then execute the routine. The overhead is small compared to the savings.
Self-modifying code has a reputation for being fragile and hard to reason about. On the 6502 and Z80, it was a completely standard technique in performance-critical code, not an ugly hack but a practical tool. The CPUs were designed with it in mind.
Where Compiled Sprites Made the Most Difference
The technique landed hardest on platforms where three things were true simultaneously: the CPU was slow, there was no hardware sprite support, and there was enough RAM to store the generated code.
ZX Spectrum and Amstrad CPC
The ZX Spectrum had no hardware sprites at all. Every character, bullet, and explosion had to be drawn in software by the Z80 at 3.5 MHz. The Amstrad CPC was in the same position. Developers on both platforms pushed compiled sprites extensively, because the alternative (masked blitting in a loop) simply wasn’t fast enough to get smooth animation of multiple characters.
The ZX Spectrum’s screen memory layout added another complication: the address mapping from pixel coordinates to memory addresses isn’t linear, so the compiled code also had to account for the Spectrum’s idiosyncratic screen layout when computing write addresses. That complexity was computed once at compile time and baked into the instruction stream, where it had no runtime cost.
Spectrum games that needed fast, smooth sprite movement often used pre-shifted sprites alongside compiled routines: pre-computing shifted copies of the sprite data for each of the eight possible pixel-column alignments, so the compiled code for each alignment could be generated up front. The result was eight compiled variants per sprite frame, one for each horizontal sub-byte position, selected at draw time. Memory-hungry, but fast.
Apple II and Atari 8-bit
On 6502-based systems, the same principle applied. The Apple II’s hi-res graphics mode had its own addressing quirks (interleaved scan lines mapped non-linearly in memory), and masked blitting through that layout was particularly expensive. Tools like AsmGen, a 6502 code generator for sprites, fonts, and images, automated the generation of compiled sprite routines for Apple II and Atari 8-bit hardware. You fed it a bitmap; it generated 6502 assembly that drew the sprite at a given position with no interpretation overhead.
The 6502’s limited instruction set made self-modifying code especially attractive here. The chip had no indirect addressing mode suitable for general sprite positioning without significant overhead. Patching the address operands directly in the instruction stream was faster than the alternatives.
Commodore 64
The C64 had hardware sprites (called MOBs, for Movable Object Blocks), which gave it eight independently positioned 24x21 hardware sprites with hardware collision detection. For a game that stayed within those eight sprites, the hardware was faster and simpler than any software technique. But eight sprites wasn’t always enough. Sprite multiplexing (switching which sprite data was active at different scanline positions via raster interrupts) could extend the count, but was complex to implement and still CPU-intensive. For larger or more complex characters that didn’t fit the hardware sprite constraints, compiled software sprites remained relevant.
The Tradeoff: Speed for Memory
Compiled sprites are fast precisely because they’re wasteful. A normal sprite is compact: one copy of the pixel data, one mask, interpreted at runtime. A compiled sprite is a custom program, one per sprite frame, per alignment, potentially per scale. The memory cost grows quickly.
On the ZX Spectrum, a 16x16 sprite with eight pre-shifted alignments meant eight compiled routines. Each routine might be several hundred bytes of Z80 code. For a game with a dozen sprite types and multiple animation frames each, the compiled code could easily exceed the sprite data it replaced by an order of magnitude.
The Allegro graphics library, which provided a compiled sprite implementation for DOS-era PC games, documented the tradeoff explicitly: compiled sprites used via get_compiled_sprite() and draw_compiled_sprite() were up to five times faster than regular draw_sprite() calls on 486-class hardware, but consumed substantially more memory and had no clipping support. Drawing outside the bitmap boundary didn’t clip gracefully; it corrupted adjacent memory. The recommendation was to use compiled sprites for sprites you knew would stay fully on-screen, and fall back to regular blitting for anything near the edges.
This is a familiar structure in constrained systems programming. You find a bottleneck, you trade one scarce resource for another, and you win as long as the resource you’re spending is less scarce than the one you’re saving. In the late 1980s, on machines with 48K or 64K of RAM running at 1-4 MHz, CPU cycles were the binding constraint and memory was slightly less scarce. Compiled sprites made exactly that trade.
The technique faded as hardware got faster. Once CPUs could blit a sprite in a handful of cycles, the overhead of a masked loop became negligible and the memory cost of compiled code stopped being worth paying. Hardware sprite engines and eventually GPU pipelines replaced both approaches entirely. But for a decade or so, on the machines where it mattered, compiling your sprites was one of the most effective things you could do.
The Hardware Answer
Compiled sprites were a software solution. Arcade hardware arrived at the same problem from a different direction.
Defender (1981) by Williams Electronics ran its 6809 CPU at 1 MHz with no dedicated sprite hardware. Every enemy, bullet, and explosion had to be drawn by the CPU directly to the frame buffer. The game managed it, but only just. Its successor, Robotron: 2084 (1982), added two dedicated blitter chips specifically to offload pixel movement from the CPU, one of the earliest instances of dedicated hardware acceleration for sprite blitting in an arcade machine. The blitter provided roughly 1 MB/second of pixel throughput. Even so, Robotron couldn’t move all its enemies in a single frame and had to distribute updates across multiple frames.
The blitter and the compiled sprite were answers to the same question: how do you move pixels fast enough when the CPU can’t do it alone? One moved the interpretation cost into silicon. The other moved it into load-time code generation. Both worked because someone traced the bottleneck to exactly where pixels were being pushed, and refused to pay that cost at runtime.
What the Constraint Required
The Spectrum had no hardware sprites. No blitter. No abstraction layer between the developer and a Z80 running at 3.5 MHz. Every moving object on screen was the programmer’s problem.
Compiled sprites are the record of someone taking that problem seriously enough to solve it exactly. Not approximately, not with an acceptable slowdown, but precisely: trace the bottleneck, identify what’s static, eliminate the redundant work.
That’s how you get to the thing I described at the start: the sprite doesn’t get drawn, it does the drawing. It’s only possible because someone asked what the masked blit was actually computing on every frame, and found that most of it was already known at load time.
That the resulting technique looks like a compiler is not a coincidence. It is a compiler, built by hand, for a single purpose, out of necessity. The platforms it ran on are obsolete. The question that produced it isn’t: what are you interpreting at runtime that you already know?