Years ago I was working at an embedded software company, profiling a 2D graphics renderer running on an ARM processor. We were trying to squeeze more frames per second out of a UI that frankly didn’t look that complex: a few icons, some solid-color panels, the occasional gradient. Nothing that should have been slow.
The profiler told a different story.
The renderer was spending most of its time not computing. Just reading. Fetching pixels out of external DRAM, one scanline at a time, pushing them through compositing, writing them back. The CPU sat there waiting. We weren’t compute-bound. We were memory-bound. Almost every optimization instinct I had from years of server-side work was pointing me in the wrong direction.
That was my introduction to a class of rendering techniques I hadn’t needed before: strategies built around the reality that on memory-constrained embedded devices, moving bytes is expensive. Spending CPU cycles to move fewer bytes is often the right trade.
This post walks through several of those techniques (dirty rectangles, tile-based rendering, and run-length encoded textures) and what I learned about when and why they work.
The Real Bottleneck: Memory Bandwidth
Modern server CPUs have enormous caches and high-bandwidth memory subsystems. Embedded processors (think ARM Cortex-M or Cortex-A running at modest clock speeds with external DRAM) often don’t. A typical setup might have a small amount of fast on-chip SRAM (kilobytes, sometimes a few hundred KB) and a much larger, much slower external flash or DRAM for storing assets and framebuffers.
The bandwidth gap is stark. On-chip SRAM might deliver data at close to CPU speed. External DRAM, accessed over a narrow bus, might saturate at 50–100 MB/s for devices with a parallel bus interface, and far slower for SPI-connected displays. A 320×240 display at 16bpp requires moving about 150KB per frame. At 30fps, that’s 4.5 MB/s just for a single full-screen blit. Add compositing layers and asset reads on top of that, and you’re fighting the bus constantly.
The CPU is not your bottleneck. The memory bus is.
Here’s what a basic blit loop looks like:
void blit(uint16_t *dst, int dst_stride,
const uint16_t *src, int src_stride,
int width, int height)
{
for (int y = 0; y < height; y++) {
memcpy(dst, src, width * sizeof(uint16_t));
dst += dst_stride;
src += src_stride;
}
}
This is correct and simple. On a desktop machine it’s basically free. On an embedded device reading src from external DRAM, every memcpy stalls waiting for the bus. The techniques below are all, at their core, strategies for making this loop touch external memory less.
Technique 1: Dirty Rectangles
The first question to ask before blitting anything is: does this region actually need to be redrawn?
Most UI screens are mostly static most of the time. A clock widget updates once per second. A progress bar moves incrementally. The background doesn’t move at all. If you redraw the entire framebuffer every frame regardless, you’re spending memory bandwidth on pixels that haven’t changed.
Dirty rectangle tracking solves this. You maintain a list of regions that have been marked as needing a redraw: a widget changed, an animation stepped forward, or a touch event triggered a visual update. At render time, you only composite and blit those regions.
#define MAX_DIRTY_RECTS 32
typedef struct { int x, y, w, h; } Rect;
typedef struct {
Rect rects[MAX_DIRTY_RECTS];
int count;
} DirtyList;
void mark_dirty(DirtyList *list, int x, int y, int w, int h) {
if (list->count < MAX_DIRTY_RECTS) {
list->rects[list->count++] = (Rect){x, y, w, h};
}
}
void flush_dirty(DirtyList *list, Surface *screen) {
for (int i = 0; i < list->count; i++) {
Rect *r = &list->rects[i];
composite_and_blit(screen, r->x, r->y, r->w, r->h);
}
list->count = 0;
}
When a widget redraws itself, it calls mark_dirty. At the end of the frame, flush_dirty composites and blits only those regions to the display. Note that mark_dirty silently drops rects once the list is full; a production implementation should merge rects or fall back to a full-screen blit on overflow rather than miss the redraw.
The savings are substantial in typical UI workloads. If only 10% of the screen changes each frame (a common case for static menus and dashboards), dirty rectangles reduce your per-frame memory bandwidth by roughly 90%. The CPU does a little bookkeeping; the bus does a lot less work.
There are edge cases. Overlapping dirty rects need to be merged or you’ll composite some regions twice. Rapidly animating content that covers most of the screen defeats the optimization entirely; at that point you’re back to full-frame rendering with overhead. The standard approach is to merge overlapping rects into a minimal covering set, and fall back to a full-screen blit when the dirty region covers more than some threshold (say, 70–80% of the display area, where the merge and tracking overhead isn’t worth the bandwidth saved).
// Merge two rects into their bounding box
Rect merge_rects(Rect a, Rect b) {
int x1 = a.x < b.x ? a.x : b.x;
int y1 = a.y < b.y ? a.y : b.y;
int x2 = (a.x + a.w) > (b.x + b.w) ? (a.x + a.w) : (b.x + b.w);
int y2 = (a.y + a.h) > (b.y + b.h) ? (a.y + a.h) : (b.y + b.h);
return (Rect){x1, y1, x2 - x1, y2 - y1};
}
Simple, but effective. Dirty rectangles are one of those techniques that feel obvious in retrospect but require deliberate design up front: every component in your UI stack needs to know how to report its own damage region.
Technique 2: Tile-Based Rendering
Dirty rectangles reduce how much of the screen you touch. Tile-based rendering is about changing how you touch it.
The idea is to divide the screen into small, fixed-size tiles, typically 16×16 or 32×32 pixels. Instead of rendering the entire scene layer by layer, you render one tile at a time, fully compositing it in on-chip SRAM before writing the result to the framebuffer.
Why does this help? Cache locality. A 16×16 tile of RGB565 pixels is 512 bytes, small enough to fit entirely in L1 cache or on-chip SRAM on most embedded processors. When you composite multiple layers within a tile, you’re reading and writing fast local memory. Only when the tile is complete do you write it out to the slow external framebuffer, in one sequential burst.
#define TILE_SIZE 16
void render_scene_tiled(Surface *screen, Layer *layers, int layer_count) {
int tiles_x = (screen->width + TILE_SIZE - 1) / TILE_SIZE;
int tiles_y = (screen->height + TILE_SIZE - 1) / TILE_SIZE;
uint16_t tile[TILE_SIZE * TILE_SIZE]; // lives in on-chip SRAM
for (int ty = 0; ty < tiles_y; ty++) {
for (int tx = 0; tx < tiles_x; tx++) {
int px = tx * TILE_SIZE;
int py = ty * TILE_SIZE;
// Clear tile
memset(tile, 0, sizeof(tile));
// Composite all layers into this tile (fast: SRAM only)
for (int l = 0; l < layer_count; l++) {
composite_layer_into_tile(tile, &layers[l], px, py, TILE_SIZE);
}
// Write completed tile to external framebuffer (one burst write)
blit_tile_to_framebuffer(screen, tile, px, py, TILE_SIZE);
}
}
}
The contrast with a naive layered approach is worth spelling out. Without tiling, compositing N layers means N separate passes over the full framebuffer; each pass reads and writes every pixel through slow external memory. With tiling, you make N passes over 512 bytes of SRAM, then one final write to external memory. Same pixel math; radically different memory access pattern.
This is not a new idea. Tile-based rendering architectures (PowerVR’s TBDR, ARM Mali’s TBIMR, Qualcomm’s Adreno variants) are the dominant design in mobile GPUs, all built on exactly this principle at the hardware level. The GPU defers shading until it knows which fragments are visible, processes them tile by tile in fast on-chip memory, and writes the completed tiles out once. What we’re doing in software on a CPU is the same concept at a smaller scale.
You can combine dirty rectangles and tile-based rendering: maintain a dirty list, but instead of blitting arbitrary rectangles, quantize them to tile boundaries and only process dirty tiles. This gives you both the “skip unchanged regions” benefit of dirty rects and the “keep working set in fast memory” benefit of tiling.
Technique 3: Run-Length Encoded Textures
Here’s the part that surprised me most when I first measured the results.
Conventional wisdom says compression adds overhead. You spend CPU cycles decoding before you can use the data. On a constrained device with limited CPU headroom, that sounds like exactly the wrong trade-off.
It turns out conventional wisdom is wrong, at least in specific, common cases. And understanding why teaches you something important about where the real cost in rendering actually lives.
Run-length encoding is about as simple as compression gets. Instead of storing every pixel value individually, you store runs: a count followed by a color value. A scanline of 320 pixels that’s mostly a solid background color might encode as a handful of runs rather than 320 individual values.
typedef struct {
uint16_t count;
uint16_t color;
} RLERun;
A more realistic encoding handles both solid runs and raw uncompressed spans, needed for image regions with high pixel variation like photos or icons:
typedef enum { RUN_SOLID, RUN_RAW } RunType;
typedef struct {
RunType type;
uint16_t count;
union {
uint16_t color; // for RUN_SOLID
const uint16_t *pixels; // for RUN_RAW (pointer into asset data)
};
} RLERun;
void decode_rle_scanline(uint16_t *dst, const RLERun *runs, int run_count) {
for (int i = 0; i < run_count; i++) {
if (runs[i].type == RUN_SOLID) {
uint16_t color = runs[i].color;
uint16_t count = runs[i].count;
for (int j = 0; j < count; j++) *dst++ = color;
} else {
memcpy(dst, runs[i].pixels, runs[i].count * sizeof(uint16_t));
dst += runs[i].count;
}
}
}
Now compare the memory access patterns for a typical UI scanline (say, 320 pixels at 16bpp, where 75% is solid background color). For this illustration, using the simple solid-run descriptor (4 bytes: count + color):
Uncompressed blit:
Read from external DRAM: 320 pixels × 2 bytes = 640 bytes
Write to framebuffer: 320 pixels × 2 bytes = 640 bytes
Total external memory: 1,280 bytes
RLE decoded blit:
Read run descriptors: ~8 runs × 4 bytes = 32 bytes (from DRAM)
Read raw pixel spans: 80 pixels × 2 bytes = 160 bytes (from DRAM)
Write to framebuffer: 320 pixels × 2 bytes = 640 bytes
Total external memory: 832 bytes
On a screen where most content is solid fills (backgrounds, buttons, panels), the read side shrinks dramatically. You’re spending more CPU instructions on the decode loop, but those instructions run on fast on-chip registers and caches. The expensive operation, the external memory read, gets much smaller.
The crossover point depends on your bus speed relative to your CPU speed. The slower your memory bus compared to your CPU, the more aggressively you want to compress. On a device where the CPU can execute hundreds of instructions in the time it takes to fetch a cache line from external DRAM, the math strongly favors compression.
Profile First. Then Optimize Exactly One Thing.
There’s a principle running underneath all of these techniques that matters more than any of them individually.
Michael Abrash wrote about it in Zen of Assembly Language in 1990, and it’s aged remarkably well. The core argument: the most important optimization is finding the right thing to optimize. Not the thing you think is slow. The thing that’s actually slow, as revealed by measurement. Abrash called it “knowing where you are.” Before you touch anything, understand exactly what the CPU is doing and why.
The concrete implication for graphics work: don’t optimize your compositing math until you’ve proven that compositing math is your bottleneck. Don’t unroll your blit loop until you’ve confirmed the loop is what’s stalling. In my case, I initially suspected the alpha blending arithmetic, which seemed like a reasonable guess. The profiler said otherwise. The blending math was fine. The memory reads were the problem.
This sounds obvious. It isn’t, in practice. There’s a strong pull toward optimizing the code you understand best, or the code that looks expensive, or the code you just wrote. Abrash’s discipline is to resist that pull entirely until you have data.
The practical workflow:
Measure first. Get a profiler attached, or instrument your render loop manually with cycle counters if you have to. You need to know where time is actually going.
Identify the single biggest bottleneck. Not the second biggest. Not a list of things to improve. The one thing that, if fixed, would move the number most.
Fix only that thing. Keep everything else identical. This is harder than it sounds. Fixing one thing has a way of suggesting adjacent improvements. Resist.
Measure again. Confirm the change had the expected effect. If it didn’t, you misidentified the bottleneck and need to go back to step one.
Repeat. After each fix, the bottleneck shifts. What was second-worst is now worst. The optimization target changes.
On the embedded renderer I was working on, this loop played out across several iterations. First pass: eliminate redundant full-screen blits with dirty rectangles. Second pass: tile the compositor to improve cache behavior. Third pass: profile again, find that certain asset types (large solid-color UI elements) were still hammering the read bus. That’s when RLE encoding those specific assets made sense.
None of these were obvious in advance. Each one revealed the next. That’s the point.
The Skia source code is a useful reference for what this looks like at scale. Its blitter infrastructure (SkRasterPipeline, the platform-specific SkBlitter implementations) separates the concern of what pixels to produce from how to move them efficiently. It’s a large, mature codebase, but the performance-critical paths are not uniformly complex; they’re complex in specific, deliberate places. Scan through src/core/SkBlitter.cpp or the SIMD-optimized blitters in src/opts/ and you’ll find code that looks almost mundane next to the intrinsic-heavy hotspots. The complexity is concentrated exactly where profiling said it had to be, and nowhere else.
That’s what Abrash was pointing at. Not “write fast code.” Write simple code, measure it, then make it fast where the measurements tell you to.
The Constraint Is the Optimization
Back to that profiler output. The renderer wasn’t slow because the code was poorly written. It was slow because it was written for the wrong machine, implicitly a machine with fast memory and cheap reads. Transplanted onto an ARM with a narrow bus and slow external DRAM, perfectly reasonable code became the bottleneck.
What I came away with was a different way of thinking about where performance lives. On memory-constrained devices, the question isn’t “how do I make this computation faster?” It’s “how do I make this computation touch slow memory less?” Sometimes that means skipping work entirely (dirty rectangles). Sometimes it means reordering work to stay in fast memory (tile-based rendering). Sometimes it means spending more CPU to read fewer bytes (RLE). The right answer depends on your specific hardware, and you won’t know which answer is right until you measure.
These techniques aren’t exotic. Dirty rectangles are in every mature UI toolkit. Tile-based rendering is the dominant architecture in mobile GPU hardware. RLE and its more sophisticated descendants (the block compression formats used in game engines like BC1, ASTC, and ETC2) are standard practice in texture pipelines. They’re common because they address a constraint that’s common: memory bandwidth is finite, and it’s often the first thing you run out of.
If you’re working on rendering performance and you haven’t profiled yet, start there. The bottleneck is probably not where you think it is. And once you find it, the fix might look less like clever code and more like a different way of counting bytes.