memory-management - freeCodeCamp.org

ITCM vs DTCM vs DDR: Embedded Memory Types Explained [Full Handbook]

Nikheel Vishwas Savant — Wed, 06 May 2026 18:43:08 +0000

Most embedded engineers hit this problem early on: the same code on the same processor runs fast in one scenario and surprisingly slow in another. The culprit is almost always where the code and data are stored in memory.

Desktop and server processors hide memory latency behind multi-level caches. Many embedded processors, especially ARM Cortex-M and Cortex-R based chips, take a different approach. They give you direct control over multiple memory regions, each with very different performance characteristics.

This handbook covers what ITCM, DTCM, and DDR memory are, how they differ, how to place code and data in the right region, and how to profile and monitor firmware memory usage over time.

Prerequisites
Why Embedded Memory Architecture Matters
What is ITCM (Instruction Tightly-Coupled Memory)?
What is DTCM (Data Tightly-Coupled Memory)?
What is DDR (Double Data Rate) Memory?
How They Compare: A Side-by-Side Overview
How to Decide Where to Place Code and Data
How the Linker Script Controls Memory Placement
Common Mistakes to Avoid
Performance Comparison With Real Numbers
How TCM Affects Power Consumption
How to Profile Memory Usage
Summary

Prerequisites

To get the most from this guide, you should have a basic understanding of C programming, including pointers, structs, and the difference between static and local variables.

Some familiarity with embedded development concepts like compiling, linking, and flashing firmware to a target board will also help.

Finally, a general sense of how a CPU fetches and executes instructions will make the performance discussions easier to follow.

You don't need to be an expert in any of these. The article explains each concept as it comes up.

Why Embedded Memory Architecture Matters

A modern embedded processor might be clocked at 400 MHz or higher. It can execute an instruction every few nanoseconds.

But when it needs to fetch that instruction from memory, or read a variable, the memory might not keep up. The processor ends up stalling, waiting for the memory subsystem to deliver the data it asked for. Those stall cycles add up fast.

On a desktop computer, hardware caches (L1, L2, L3) sit between the CPU and main memory, automatically keeping recently-used data nearby. The cache hardware decides what to keep and what to evict, and it does this transparently. The programmer rarely needs to think about it, and performance is generally good enough without manual intervention.

On many embedded processors, the situation is different. Instead of hardware caches, you get three distinct memory regions, each attached to the CPU in a different way.

Memory Type	What It Stores	Access Speed	Typical Size
ITCM	Instructions (executable code)	Single-cycle (deterministic)	512 KB to 2 MB
DTCM	Data (variables, stacks, buffers)	Single-cycle (deterministic)	512 KB to 1.5 MB
DDR	Everything else	Multi-cycle (variable)	4 MB to several GB

The table above shows the three memory types you'll encounter on a typical ARM Cortex-M or Cortex-R-based embedded system. ITCM and DTCM are fast but small. DDR is slow but large.

The "deterministic" label on TCM means that the access time is always the same, every single time, regardless of what accessed that memory before or what else is happening on the chip. The "variable" label on DDR means the access time can change depending on the internal state of the DDR chip and its controller.

You, the developer, control which region each piece of your firmware lives in. The compiler and linker don't make these decisions automatically. You specify them through section attributes in your source code and placement rules in your linker script. Getting this right is often the difference between firmware that meets its real-time deadlines and firmware that misses them.

What is ITCM (Instruction Tightly-Coupled Memory)?

ITCM stands for Instruction Tightly-Coupled Memory.

The "Instruction" part means this memory is used for storing executable machine code, the compiled instructions your CPU fetches and runs.

The "Tightly-Coupled" part means the memory is physically located on the same silicon die as the CPU core, connected through a dedicated bus with no arbitration or contention. There's no shared bus to compete with. There's no cache hierarchy to traverse. The CPU asks for an instruction, and ITCM delivers it directly, through a private path that nothing else on the chip can interfere with.

The CPU can fetch an instruction from ITCM in a single clock cycle, every time. This access time is both fast and deterministic. It doesn't vary based on access patterns, recent history, or what else is happening on the bus.

This determinism is just as important as the raw speed, because it makes worst-case execution time analysis possible. In safety-critical systems, you need to be able to prove that a function will always complete within a certain number of cycles. ITCM makes that proof much simpler.

Why Single-Cycle Fetch Matters

Every line of C code compiles down to one or more machine instructions. Each of those instructions must be fetched from memory before the CPU can decode and execute it. This fetch step happens for every single instruction, so even small per-instruction delays compound rapidly in loops and frequently-called functions.

Consider a loop that runs 1,000,000 iterations, where each iteration involves 10 instruction fetches. That's 10 million fetches total.

ITCM:  10,000,000 fetches x 1 cycle  = 10,000,000 cycles
DDR:   10,000,000 fetches x 8 cycles = 80,000,000 cycles

Difference: 70,000,000 cycles
At 400 MHz: 70,000,000 / 400,000,000 = 0.175 seconds = 175 ms

This calculation compares the total cycle count when the same loop runs from ITCM versus DDR. With ITCM, each fetch takes 1 cycle, so 10 million fetches cost 10 million cycles.

With DDR, each fetch takes 8 cycles (a conservative average), so the same 10 million fetches cost 80 million cycles. The difference is 70 million cycles, which at 400 MHz translates to 175 milliseconds.

In a real-time system running a control loop at 1 kHz (one iteration every 1 ms), 175 ms of extra latency spread across your processing isn't a minor inconvenience. It can cause the system to miss deadlines, drop sensor readings, or produce incorrect outputs. In motor control applications, a missed deadline can mean physical damage to the hardware. In audio processing, it means audible glitches. The cost of slow instruction fetch isn't abstract.

What Should Go in ITCM?

Because ITCM is small (typically 512 KB to 2 MB), you can't fit your entire firmware in it. You need to be selective about what earns a spot.

Interrupt Service Routines (ISRs) are the highest-priority candidates. ISRs run in response to hardware events like a timer tick, an ADC conversion completing, or a communication peripheral receiving data. They need to execute and return as quickly as possible.

A slow ISR delays all lower-priority interrupts and can cause missed events. If your ISR fetches its instructions from DDR, each fetch takes multiple cycles, and the total ISR execution time increases by a factor that could push it past its deadline.

Placing ISRs in ITCM ensures they run at maximum speed with completely predictable timing.

Real-time processing functions are the next priority. These include signal processing routines, motor control loops, audio processing pipelines, and any function that runs at a fixed rate and must complete within a strict time budget.

If your audio codec callback needs to process a buffer of samples every 5 ms, every instruction fetch cycle counts. Placing these functions in ITCM gives you the maximum amount of CPU time for actual computation rather than waiting on memory.

Inner loops of your main processing pipeline also benefit significantly from ITCM placement. If your firmware spends 80% of its time in a handful of functions, those functions should be in ITCM. Profiling tools and the linker map file (covered later in this article) can help you identify which functions are the hottest.

Functions that require deterministic timing belong in ITCM even if they aren't the fastest path. ITCM access time doesn't vary, which makes timing analysis predictable. This matters for safety-critical systems (automotive, medical, aerospace) where you need to prove worst-case execution times to a certification authority.

How to Place a Function in ITCM

You use a GCC section attribute to tell the compiler that a function belongs in a specific memory section. Then, in your linker script, you map that section to the ITCM memory region.

__attribute__((section(".itcm_text")))
void my_critical_isr(void) {
    volatile uint32_t *sensor_reg = (volatile uint32_t *)0x40001000;
    uint32_t reading = *sensor_reg;
    process_sample(reading);
}

In this code, the __attribute__((section(".itcm_text"))) directive tells the compiler to emit this function's compiled machine code into a section called .itcm_text instead of the default .text section. The function itself reads a sensor register at the memory-mapped address 0x40001000, stores the result in a local variable, and passes it to process_sample() for further processing. The volatile keyword tells the compiler that this memory address can change at any time (because it is a hardware register), so the compiler must not optimize away the read.

On its own, the section attribute doesn't determine where the function ends up in physical memory. It just tells the compiler to label the function's code with a specific section name.

The actual memory placement is the linker script's job, which maps .itcm_text to the ITCM address range. We'll cover the linker script in detail in a later section.

How Much ITCM is Typical?

A real-world memory profile from an embedded project, to give you a sense of scale:

Memory region         Used Size  Region Size  %age Used
            ITCM:      570936 B         2 MB     27.22%
            DTCM:      727240 B    1572608 B     46.24%
             DDR:      622915 B         4 MB     14.85%

This output comes from the linker map file's summary section. It shows three memory regions and how much of each one is used by the compiled firmware.

ITCM has 2 MB available and the firmware is using about 557 KB (27.22%). DTCM has about 1.5 MB available and is using 727 KB (46.24%). DDR has 4 MB available and is using about 609 KB (14.85%).

This project uses about 557 KB of the available 2 MB of ITCM, roughly 27%. That leaves good headroom for growth.

In practice, you want to keep ITCM utilization below 80-85% to leave room for future features and library updates. If utilization climbs above 90%, you're one feature addition away from a build failure, and you should proactively move less-critical code to DDR.

What is DTCM (Data Tightly-Coupled Memory)?

DTCM stands for Data Tightly-Coupled Memory. It works on the same principle as ITCM (physically close to the CPU core, connected via a dedicated bus, single-cycle access) but it stores data instead of instructions.

If ITCM is where your code lives, DTCM is where your code works. It's the fast scratch space that the CPU reads from and writes to while executing your performance-critical functions. Every variable read, every array access, every stack push and pop in your hot code paths goes through data memory. Making that data memory as fast as possible eliminates one of the biggest sources of stall cycles.

What Kind of Data Belongs in DTCM?

Stack frames are the most important thing in DTCM. Every function call pushes a stack frame containing local variables, the return address, and saved registers. Every function return pops that frame. I

f your stack is in DTCM, the memory-access portion of function calls and returns happens in a single cycle. If your stack were in DDR, every function call and return would incur multiple cycles of memory latency just for the stack operations alone, before the function even begins doing useful work.

On most Cortex-M and Cortex-R configurations, the startup code initializes the stack pointer to point into DTCM by default, so you get this benefit without any extra configuration.

Frequently accessed global variables are another strong candidate. State machine variables, control flags, sensor readings that are updated and read in every loop iteration, counters that are incremented in ISRs and read in the main loop: all of these benefit from single-cycle access.

If a variable is read or written thousands of times per second, the cumulative latency difference between DTCM and DDR adds up.

Small lookup tables used in hot paths belong in DTCM when they're small enough to fit. Sine/cosine tables for motor control, filter coefficients for audio processing, and CRC tables for communication protocols are common examples.

These tables are typically a few hundred bytes to a few kilobytes, and they get accessed on every iteration of a processing loop. The key word is "small." A 512-byte sine table is a good fit for DTCM. A 64 KB calibration table is not, and should go in DDR instead.

DMA buffers can sometimes go in DTCM, but this depends on your chip's bus architecture. On some chips, the DMA controller has a direct path to DTCM through the bus matrix. On others, the DMA controller can only reach DDR and possibly other SRAM regions. If you place a DMA buffer in DTCM on a chip where the DMA controller can't reach it, the transfer will silently fail or write to a completely wrong address.

Always check your chip's bus matrix diagram in the reference manual before putting DMA buffers in DTCM.

How to Place Data in DTCM

Placing data in DTCM uses the same section attribute mechanism as ITCM, but with a section name that your linker script maps to the DTCM address range.

__attribute__((section(".dtcm_data")))
static int16_t audio_buffer[256];

__attribute__((section(".dtcm_data")))
static volatile uint32_t sensor_state = 0;

In this code, audio_buffer is an array of 256 signed 16-bit integers (512 bytes total) that will be placed in DTCM. This could be a buffer for audio samples that gets filled by a DMA transfer and processed by an ISR. The static keyword means the buffer has file scope and persists for the lifetime of the program (it's not allocated on the stack).

The sensor_state variable is a 32-bit unsigned integer marked as volatile, meaning the compiler must read it from memory every time it's accessed rather than caching it in a register.

This is important for variables that are written in an ISR and read in the main loop, since the compiler needs to know the value can change at any time. Placing it in DTCM ensures that both the ISR write and the main loop read happen in a single cycle.

DTCM Fills Up Faster Than ITCM

Looking at the memory profile again:

            DTCM:      727240 B    1572608 B     46.24%

This single line from the linker map file summary shows that DTCM has 1,572,608 bytes (about 1.5 MB) available, and the firmware is using 727,240 bytes (about 710 KB), which is 46.24% of the total capacity.

DTCM fills up faster than ITCM because many things compete for it: your stack, your heap (if you have one), your global variables, and data sections from every library you link against. Every C library function that uses static data, every RTOS data structure, every middleware component brings its own data footprint. This creates a constant sizing exercise.

For every data structure, you need to ask: does this really need single-cycle access, or can it work from DDR?

A Concrete Example of the Performance Impact

Say your processor runs at 400 MHz. DTCM gives you 1-cycle access. DDR gives you 8-cycle access. You have a lookup table that gets accessed 100,000 times per second.

DTCM: 100,000 accesses x 1 cycle  = 100,000 cycles/sec
DDR:  100,000 accesses x 8 cycles = 800,000 cycles/sec

Difference: 700,000 cycles/sec
At 400 MHz: 700,000 / 400,000,000 = 0.00175 seconds = 1.75 ms

This calculation shows the cycle cost of 100,000 memory accesses per second in both memory types. In DTCM, each access is 1 cycle, totaling 100,000 cycles. In DDR, each access is 8 cycles, totaling 800,000 cycles. The difference of 700,000 cycles per second, at a 400 MHz clock rate, translates to 1.75 milliseconds of additional CPU time spent waiting on memory.

If you're running a real-time control loop at 1 kHz (1 ms period), 1.75 ms of additional memory latency per second means that some individual iterations are running longer than their 1 ms budget. Whether this causes actual deadline misses depends on how the accesses are distributed across iterations and how much slack you have in your time budget, but it shows why memory placement decisions have real consequences in embedded systems.

What is DDR (Double Data Rate) Memory?

DDR is external memory. It sits on the circuit board outside the processor die, connected through a memory controller. It's much larger than TCM (typically 4 MB to several GB), but significantly slower to access.

The name "Double Data Rate" refers to how data is transferred between the DDR chip and the memory controller: data is sent on both the rising edge and the falling edge of the clock signal, effectively doubling the transfer rate compared to a single-data-rate design. But this doesn't eliminate the latency of activating rows and columns inside the DDR chip, which is where the slowness comes from.

How DDR Access Works

When your CPU reads from DDR, a multi-step process occurs inside the memory controller and DDR chip.

First, the CPU sends an address request to the memory controller. The memory controller is a hardware block inside the processor that translates CPU addresses into the specific row and column addresses that the DDR chip understands.

Second, the memory controller activates the correct row inside the DDR chip. This step is called the RAS (Row Address Strobe) phase. The DDR chip is organized as a grid of tiny capacitors, and "activating a row" means reading all the capacitors in that row into a row buffer inside the DDR chip. This takes several clock cycles.

Third, the memory controller selects the correct column within the activated row. This is called the CAS (Column Address Strobe) phase. The DDR chip uses the column address to pick the right bits out of the row buffer. This also takes several clock cycles.

Fourth, the data is transferred back to the memory controller, and from there to the CPU. The data transfer happens on both clock edges (the "double data rate" part), which helps with throughput but doesn't reduce the initial latency of the RAS and CAS phases.

The total latency depends on what state the memory is in when the request arrives. If the correct row is already activated from a previous access (a "row hit"), the RAS phase can be skipped, and the access is faster. If a different row is active and needs to be closed (precharged) before the new row can be opened (a "row miss"), the access takes longer. If the DDR chip happens to be performing a refresh cycle at that moment, the access is delayed further.

In practice, DDR access latency ranges from about 5 to 20+ CPU clock cycles, depending on the access pattern and timing.

Why DDR is Necessary

Because firmware often doesn't fit in TCM alone. Real embedded projects include protocol stacks, connectivity libraries, file system drivers, debug interfaces, and more. TCM is typically 2 to 3.5 MB total (ITCM + DTCM combined), and a full-featured firmware image can easily exceed that.

A real example showing memory usage before and after adding a wireless connectivity stack:

Without connectivity stack:
    ITCM:      506,996 B     (24.18%)
    DTCM:      628,408 B     (39.96%)
    DDR:       558,779 B     (13.32%)

With connectivity stack:
    ITCM:      570,936 B     (27.22%)
    DTCM:      727,240 B     (46.24%)
    DDR:       622,915 B     (14.85%)

Delta:
    ITCM: +63,940 B   (~62 KB of additional code)
    DTCM: +98,832 B   (~96 KB of additional data)
    DDR:  +64,136 B   (~62 KB of additional data/code)

This comparison shows memory usage from the same project built with and without a wireless connectivity stack.

The "Without" rows show the baseline. The "With" rows show the usage after adding the connectivity feature. The "Delta" rows show the difference.

Adding this single feature consumed an extra ~220 KB across all three memory regions. The time-critical parts of the stack (interrupt handlers, buffer management) went into ITCM and DTCM. The rest (packet parsers, connection management, configuration logic) went into DDR where it doesn't need single-cycle performance.

What Belongs in DDR?

Initialization and configuration code is the easiest category. Functions that run once at boot, like parsing a configuration file, initializing peripherals, or setting up data structures, don't need fast execution. They run once, take a few extra milliseconds because of DDR latency, and then never run again. Nobody notices. Put them in DDR and save TCM space for the code that runs a million times per second.

Large buffers must go in DDR because they simply can't fit in TCM. An image framebuffer for a 320x240 display at 16 bits per pixel is 150 KB. A network packet pool might be 32 KB or more. A file system cache might be 64 KB. These buffers would consume a significant fraction of DTCM's total capacity, leaving no room for the stack and variables that actually need single-cycle access.

Infrequently accessed data belongs in DDR as well. Calibration tables that are loaded once at boot and then read occasionally during operation, string tables for debug messages that are only printed during development or error conditions, and error description tables are all fine in DDR. The extra latency per access is irrelevant when the access count is low.

Non-time-critical code rounds out the DDR category. Protocol stacks (Bluetooth, Wi-Fi, TCP/IP), file system drivers, OTA update handlers, and shell/debug command interpreters all do important work, but none of them need to execute in a single clock cycle per instruction. They can tolerate the higher latency of DDR without affecting system behavior.

How to Place Code and Data in DDR

__attribute__((section(".ddr_text")))
void parse_config_file(const char *path) {
    // Runs from DDR, slower instruction fetch,
    // but config parsing happens once at boot,
    // so the latency does not affect runtime performance.
}

__attribute__((section(".ddr_bss")))
static uint8_t network_packet_pool[32768];

__attribute__((section(".ddr_bss")))
static uint8_t framebuffer[320 * 240 * 2];  // 150 KB, far too large for TCM

In this code, parse_config_file is placed in the .ddr_text section, which the linker script maps to DDR. Every instruction in this function will be fetched from DDR at multi-cycle latency, but since config parsing happens once at boot, the extra time is negligible.

The network_packet_pool is a 32 KB buffer placed in .ddr_bss. The .bss suffix is a convention indicating that this is zero-initialized data (the linker will ensure the memory is zeroed at startup rather than storing 32 KB of zeros in the firmware image). This buffer is used for network packet storage, which is not time-critical enough to justify DTCM space.

The framebuffer is a 150 KB buffer (320 pixels wide, 240 pixels tall, 2 bytes per pixel) also placed in .ddr_bss. At 150 KB, this single buffer would consume about 10% of DTCM's total capacity, which is far too expensive when the display update isn't a hard real-time operation.

How They Compare: A Side-by-Side Overview

Property	ITCM	DTCM	DDR
Purpose	Instruction storage	Data storage	General-purpose storage
Location	On-die, dedicated bus	On-die, dedicated bus	Off-chip, through memory controller
Access latency	1 cycle (deterministic)	1 cycle (deterministic)	5 to 20+ cycles (variable)
Typical size	512 KB to 2 MB	512 KB to 1.5 MB	4 MB to several GB
Technology	SRAM	SRAM	DRAM (requires refresh)
Power	Low (no refresh needed)	Low (no refresh needed)	Higher (constant refresh)
Best for	ISRs, real-time loops, DSP	Stack, hot variables, lookup tables	Large buffers, init code, protocol stacks

This table summarizes the key differences between the three memory types. The most important columns are "Access latency" and "Typical size," because they represent the fundamental tradeoff: TCM is fast but small, DDR is slow but large.

The "Technology" column explains why: TCM uses SRAM (static RAM), which stores each bit using a flip-flop circuit that holds its state as long as power is applied. DDR uses DRAM (dynamic RAM), which stores each bit as charge in a tiny capacitor. Because capacitors leak charge, DRAM must be periodically refreshed, which adds power consumption and introduces occasional access delays when a refresh cycle coincides with a read request.

The Memory Map

Address Space:
  +------------------------------+  0x00000000
  |                              |
  |         ITCM (2 MB)          |  Single-cycle Inst Fetch
  |    ISRs, real-time loops,    |
  |    DSP, critical code        |
  |                              |
  +------------------------------+  0x00200000
  |       (reserved/gap)         |
  +------------------------------+  0x20000000
  |                              |
  |       DTCM (~1.5 MB)         |  Single-cycle Data Access
  |    Stack, hot variables,     |
  |    lookup tables, DMA bufs   |
  |                              |
  +------------------------------+  0x20180000
  |       (reserved/gap)         |
  +------------------------------+  0x80000000
  |                              |
  |         DDR (4 MB)           |  Multi-cycle Access
  |    Large buffers, init code, |
  |    protocol stacks, config   |
  |                              |
  +------------------------------+  0x80400000

This diagram shows the CPU's address space laid out from low addresses at the top to high addresses at the bottom. ITCM occupies the lowest 2 MB starting at address 0x00000000. After a gap of reserved/unused address space, DTCM sits at 0x20000000 and spans about 1.5 MB. Another gap of reserved space follows, and then DDR starts at 0x80000000 with 4 MB of space.

The gaps between regions are important. They're reserved address ranges that don't map to any physical memory. If your code accidentally reads from or writes to an address in one of these gaps, the result depends on the chip's bus fault configuration: it might trigger a HardFault exception, or it might silently return garbage data.

These addresses are illustrative. Every chip has its own memory map, documented in its Technical Reference Manual (TRM). Always consult your chip's TRM for the exact addresses and sizes.

How to Decide Where to Place Code and Data

Is it code or data?
|
+-- CODE (instructions):
|   +-- Called from an ISR or runs in a real-time loop?
|   |   +-- YES -> ITCM (deterministic timing is critical)
|   +-- Called frequently in the main processing pipeline?
|   |   +-- YES -> ITCM (if space is available)
|   +-- Called rarely (init, config, debug)?
|       +-- DDR (save ITCM space for critical code)
|
+-- DATA (variables, buffers, tables):
    +-- Accessed in an ISR or real-time context?
    |   +-- YES -> DTCM (single-cycle, deterministic)
    +-- Small and frequently accessed?
    |   +-- YES -> DTCM (if space is available)
    +-- Large buffer (>16 KB)?
    |   +-- Probably DDR (DTCM cannot afford the space)
    +-- Accessed only once at boot or very rarely?
        +-- DDR (do not use DTCM for this)

This decision tree captures the thought process for placing each piece of firmware into the right memory region.

Start by asking whether you're placing code (instructions) or data (variables, buffers, tables). For code, the primary question is how often it runs and whether it has timing constraints. ISR code and real-time loop code goes in ITCM. Everything else goes in DDR. For data, the primary question is how often it's accessed and how large it is. Small, frequently accessed data goes in DTCM. Large buffers and rarely-accessed data go in DDR.

The general principle: put the hottest code and data in TCM, and everything else in DDR. "Hot" means frequently accessed, latency-sensitive, or requiring deterministic timing. When in doubt, start with DDR placement and move things to TCM only when profiling shows it's necessary. It's much easier to promote a function from DDR to ITCM after discovering it's a bottleneck than to cram everything into ITCM from the start and run out of space.

How the Linker Script Controls Memory Placement

Everything we've discussed so far (section attributes, memory placement, address assignments) comes together in the linker script. This is a file (usually with a .ld extension) that tells the linker exactly which sections go into which memory regions. The linker script is the single source of truth for your firmware's memory layout.

MEMORY
{
    ITCM    (rx)  : ORIGIN = 0x00000000, LENGTH = 2M
    DTCM    (rw)  : ORIGIN = 0x20000000, LENGTH = 1536K
    DDR     (rwx) : ORIGIN = 0x80000000, LENGTH = 4M
}

SECTIONS
{
    /* === ITCM: Critical code === */
    .itcm_text :
    {
        KEEP(*(.isr_vector))          /* Interrupt vector table */
        *(.itcm_text)                 /* Functions with __attribute__((section(".itcm_text"))) */
        *audio_processing.o(.text)    /* All code from audio_processing.c */
        *motor_control.o(.text)       /* All code from motor_control.c */
    } > ITCM

    /* === DDR: Non-critical code === */
    .ddr_text :
    {
        *(.text)                      /* Default catch-all for remaining code */
        *(.text*)
        *(.rodata)                    /* Read-only data (string literals, constants) */
        *(.rodata*)
    } > DDR

    /* === DTCM: Critical data === */
    .dtcm_data :
    {
        *(.dtcm_data)                 /* Data with __attribute__((section(".dtcm_data"))) */
        *audio_processing.o(.data)    /* All initialized data from audio_processing.c */
        *audio_processing.o(.bss)     /* All zero-initialized data from audio_processing.c */
    } > DTCM

    /* === DTCM: Stack === */
    .stack (NOLOAD) :
    {
        . = ALIGN(8);
        __stack_start = .;
        . = . + 8K;                  /* 8 KB stack */
        __stack_end = .;
    } > DTCM

    /* === DDR: Everything else === */
    .ddr_data :
    {
        *(.data)                      /* Default catch-all for remaining initialized data */
        *(.bss)                       /* Default catch-all for remaining zero-initialized data */
        *(COMMON)
    } > DDR
}

This linker script has two main blocks: MEMORY and SECTIONS.

The MEMORY block defines the physical memory regions available on the chip. Each line declares a region name, its permissions (rx for read-execute, rw for read-write, rwx for read-write-execute), its starting address (ORIGIN), and its size (LENGTH). These values must match your chip's actual memory map as documented in its reference manual.

The SECTIONS block defines how the linker should distribute compiled code and data across those memory regions. Each section rule consists of a section name (like .itcm_text), a list of input patterns that specify which object file sections to include, and a > REGION directive that tells the linker which memory region to place the output section in.

The .itcm_text section collects the interrupt vector table (KEEP(*(.isr_vector))), any functions explicitly marked with __attribute__((section(".itcm_text"))), and all code from audio_processing.o and motor_control.o. The KEEP directive prevents the linker from discarding the interrupt vector table during garbage collection, even if no code appears to reference it directly. All of this goes into ITCM.

The .ddr_text section uses catch-all patterns *(.text) and *(.text*) to collect all remaining code that wasn't claimed by the ITCM section above. It also collects read-only data (.rodata), which includes string literals and const variables. All of this goes into DDR.

The .dtcm_data section collects explicitly-placed data and all data from audio_processing.o. The .stack section reserves 8 KB for the stack with 8-byte alignment, and exports the __stack_start and __stack_end symbols that your startup code and stack profiling code can reference. Both go into DTCM.

The .ddr_data section collects all remaining data with catch-all patterns, and goes into DDR.

How Section Matching Works

The linker processes sections from top to bottom. When it encounters a wildcard pattern like *(.text), it matches all .text sections that haven't already been claimed by a more specific rule earlier in the script.

So in the example above, *audio_processing.o(.text) in the ITCM section claims all code from audio_processing.c first. Then, when the linker reaches *(.text) in the DDR section, audio_processing.o's .text section has already been placed, so it's skipped. Only unclaimed .text sections from other object files match the DDR catch-all.

This means the order of sections in your linker script matters. Place your specific rules (individual object files, named sections) before the generic catch-all rules. If you put the *(.text) catch-all before the *audio_processing.o(.text) rule, the catch-all would claim everything first, and the specific rule would match nothing.

Common Mistakes to Avoid

1. Stack Overflow in DTCM

Your stack lives in DTCM. DTCM is small. If you declare a large local array inside a function, it goes on the stack:

void problematic_function(void) {
    uint8_t huge_local_buffer[65536];  // 64 KB allocated on the stack
    // This consumes 64 KB of DTCM immediately
}

This code declares a 64 KB local array. Because it's a local variable (not static), it is allocated on the stack when the function is called. If your total stack size is 8 KB (as in the linker script example above), this single declaration overflows the stack by 56 KB, writing into whatever memory is adjacent to the stack in DTCM.

On a desktop OS, a stack overflow triggers a segmentation fault because the OS uses virtual memory and guard pages to detect it.

In an embedded system without memory protection, the stack silently grows into adjacent memory regions, corrupting whatever data is stored there. The resulting bugs are extremely difficult to diagnose because the symptoms (corrupted variables, erratic behavior, intermittent crashes) appear unrelated to the actual cause. You might spend days debugging a seemingly random data corruption issue before realizing the root cause is a stack overflow from a function three call levels deep.

The fix: Use static allocation or heap allocation for large buffers, and place them in DDR:

void fixed_function(void) {
    __attribute__((section(".ddr_bss")))
    static uint8_t huge_buffer[65536];  // In DDR, not on the stack

    // Stack is safe, DTCM is not wasted
}

By making the buffer static, it's no longer allocated on the stack. Instead, the linker allocates it once in the .ddr_bss section, which maps to DDR. The buffer persists for the entire lifetime of the program (like a global variable), but its name is scoped to this function. The stack only holds a pointer to the buffer, which is a few bytes instead of 64 KB.

2. Overfilling ITCM

If you exceed ITCM's capacity, the linker will produce an error along the lines of "region ITCM overflowed by N bytes." But if you're close to the limit, you're one library update or feature addition away from a build failure. A minor version bump of your RTOS or connectivity stack could add enough code to push ITCM over the edge.

Keep headroom. The 27% utilization shown earlier is healthy. If you're above 85%, you should actively work on moving less-critical code to DDR. If you're above 95%, you have no room for growth and need to make immediate changes. Setting up automated memory budget checks in your CI pipeline (covered later in this article) prevents surprises.

3. Ignoring Alignment Requirements

TCM memories often have alignment requirements. On Cortex-M processors with strict alignment enforcement, accessing a 32-bit value at an unaligned address causes a HardFault exception.

/* Problematic: packed struct can create unaligned fields */
__attribute__((section(".dtcm_data"), packed))
struct badly_aligned {
    uint8_t  flag;
    uint32_t counter;  // May be at byte offset 1, unaligned
};

/* Correct: natural alignment, with minor padding */
__attribute__((section(".dtcm_data")))
struct properly_aligned {
    uint32_t counter;  // At offset 0, 4-byte aligned
    uint8_t  flag;     // At offset 4
    // 3 bytes of padding follow, a small cost for correctness
};

In the first struct, the packed attribute tells the compiler to use no padding between fields. This means counter starts at byte offset 1 (right after the 1-byte flag), which isn't a multiple of 4. When the CPU tries to read a 32-bit value from a non-4-byte-aligned address in TCM, it triggers a HardFault on processors with strict alignment (which includes most Cortex-M cores).

In the second struct, the fields are ordered so that counter (4 bytes) comes first at offset 0, which is naturally 4-byte aligned. The flag (1 byte) follows at offset 4. The compiler inserts 3 bytes of padding after flag to bring the struct size to 8 bytes (a multiple of 4), but this is a small price for correct, crash-free operation.

4. DMA Transfers to TCM on Incompatible Bus Architectures

Some DMA controllers can't access TCM memory. Whether DMA can reach TCM depends entirely on your chip's internal bus architecture (the bus matrix).

If you set up a DMA transfer from a peripheral to a DTCM buffer, but the DMA controller doesn't have a bus path to DTCM, the transfer will either silently fail or write to an incorrect address.

Neither produces an obvious error. The DMA controller thinks it completed successfully, your code reads the buffer expecting fresh data, and you get stale or garbage values instead. This is one of the most confusing bugs in embedded development because everything looks correct in the code.

Always check your chip's bus matrix diagram in the reference manual before using DMA with TCM buffers. The bus matrix diagram shows which masters (CPU, DMA, USB, and so on) can access which slaves (ITCM, DTCM, SRAM, DDR, peripherals). Look for whether the DMA controller's master port has a connection line to the TCM slave port. If it doesn't, your DMA transfers to TCM will not work.

Performance Comparison With Real Numbers

The following table compares access latencies across memory types, assuming a Cortex-R class processor at 400 MHz:

+---------------------+----------+----------+----------+
| Operation           | ITCM/    |   DDR    | Slowdown |
|                     | DTCM     |          | Factor   |
+---------------------+----------+----------+----------+
| Instruction fetch   | 1 cycle  | 5-20 cyc |   5-20x  |
| Data read (32-bit)  | 1 cycle  | 5-20 cyc |   5-20x  |
| Data write (32-bit) | 1 cycle  | 5-20 cyc |   5-20x  |
| Sequential burst    | 1 cyc/wd | 2-4 cy/wd|    2-4x  |
| Random access       | 1 cycle  | 10-20 cyc|  10-20x  |
+---------------------+----------+----------+----------+

This table shows the latency for five different types of memory operations. The first three rows (instruction fetch, data read, data write) show that individual accesses to TCM are always 1 cycle, while individual accesses to DDR range from 5 to 20 cycles depending on the memory's internal state. The slowdown factor is the ratio between the two.

The "Sequential burst" row shows what happens when you read or write consecutive addresses. DDR performs much better in burst mode (2-4 cycles per word instead of 5-20) because once a row is activated, subsequent reads from the same row skip the RAS phase. TCM is still 1 cycle per word because it doesn't have the row/column structure of DDR.

The "Random access" row shows the worst case for DDR. When each access hits a different row, the memory controller must precharge the old row and activate the new one every time. This is the 10-20 cycle range, and it's common in workloads that jump around in memory (traversing linked lists, hash table lookups, and indirect function calls through function pointer arrays).

The practical takeaway: if your code accesses DDR data, try to access it sequentially. Iterating through an array in order is much faster than jumping to random positions. Your memory controller and the DDR chip's internal prefetch logic work in your favor during sequential access patterns.

How TCM Affects Power Consumption

Memory placement has a direct impact on power consumption, something that becomes critical for battery-powered products.

DDR requires constant refresh cycles. DRAM stores each bit as charge in a tiny capacitor, and that charge leaks over time.

To prevent data loss, the memory controller must read and rewrite every row in the DDR chip approximately every 64 ms. This refresh process consumes power even when the processor is sleeping and no code is running. On some systems, DDR refresh can account for a significant portion of the total sleep-mode power budget.

TCM is SRAM-based and doesn't require refresh. SRAM stores data using flip-flop circuits that hold their state as long as power is applied. There is some leakage current (no transistor is perfect), but it is orders of magnitude lower than DDR refresh power.

For battery-powered devices (wearables, IoT sensors, medical devices), this means you should keep data that must survive sleep modes in DTCM when possible.

If your hardware supports it, power-gate the DDR chip during deep sleep to eliminate its refresh power entirely. The less DDR your firmware uses at runtime, the more aggressively you can manage DDR power states, which directly extends battery life.

How to Profile Memory Usage

After placing code and data into ITCM, DTCM, and DDR, you need to verify that everything fits, monitor usage over time, and catch regressions before they become build failures. There are several techniques for this, ranging from simple command-line tools to automated CI checks.

Method 1: The Linker Map File

Every time you build your firmware, the linker can produce a map file, a detailed text file that records where every symbol (function, variable, constant) ended up and how large it is. This is the most useful single artifact in embedded development for understanding memory usage.

To generate one, add -Wl,-Map=output.map to your linker flags:

arm-none-eabi-gcc \
    -T linker_script.ld \
    -Wl,-Map=firmware.map \
    -o firmware.elf \
    main.o audio.o bluetooth.o

This command invokes the ARM GCC toolchain to link three object files (main.o, audio.o, bluetooth.o) using the linker script linker_script.ld. The -Wl,-Map=firmware.map flag tells GCC to pass the -Map=firmware.map option to the linker, which causes it to write a detailed map file alongside the output ELF binary. The map file can be thousands of lines long, but the most useful part is the summary at the end.

The summary at the end of the map file shows overall utilization per memory region:

Memory region         Used Size  Region Size  %age Used
            ITCM:      570936 B         2 MB     27.22%
            DTCM:      727240 B    1572608 B     46.24%
             DDR:      622915 B         4 MB     14.85%

This summary shows three columns: how many bytes are used, the total size of the region, and the percentage used. It gives you the health of your firmware at a glance. As a rule of thumb, below 80% is healthy with room for growth. Between 80% and 90% is getting tight, and you should plan for how you will accommodate the next feature. Above 90% requires action: start moving things to a cheaper memory region or optimizing existing placement.

Method 2: Parsing the Map File for Per-Module Breakdown

The summary tells you how much memory is used, but not who is using it. The map file contains per-symbol details, but they're difficult to read manually because the file can be thousands of lines long with a format that isn't designed for human consumption.

The following Python script parses the map file and produces a per-module report showing which object files are consuming memory in which regions.

#!/usr/bin/env python3
"""Parse a linker map file and report memory usage per object file."""

import re
import sys
from collections import defaultdict

def parse_map_file(map_path):
    """Extract symbol placements from a GCC linker map file."""
    usage = defaultdict(lambda: defaultdict(int))

    regions = {
        'ITCM': (0x00000000, 0x00200000),
        'DTCM': (0x20000000, 0x20180000),
        'DDR':  (0x80000000, 0x80400000),
    }

    def addr_to_region(addr):
        for name, (start, end) in regions.items():
            if start <= addr < end:
                return name
        return 'UNKNOWN'

    symbol_re = re.compile(
        r'^\s+\S+\s+(0x[0-9a-fA-F]+)\s+(0x[0-9a-fA-F]+)\s+(\S+\.o)'
    )

    with open(map_path) as f:
        for line in f:
            m = symbol_re.match(line)
            if m:
                addr = int(m.group(1), 16)
                size = int(m.group(2), 16)
                obj = m.group(3).split('/')[-1]
                region = addr_to_region(addr)
                usage[obj][region] += size

    return usage

def print_report(usage):
    """Print a sorted memory usage report."""
    print(f"{'Object File':<35} {'ITCM':>10} {'DTCM':>10} {'DDR':>10} {'Total':>10}")
    print("-" * 80)

    totals = defaultdict(int)
    rows = []

    for obj, regions in usage.items():
        total = sum(regions.values())
        rows.append((obj, regions, total))
        for r, s in regions.items():
            totals[r] += s

    rows.sort(key=lambda x: x[2], reverse=True)

    for obj, regions, total in rows[:20]:
        print(f"{obj:<35} "
              f"{regions.get('ITCM', 0):>10,} "
              f"{regions.get('DTCM', 0):>10,} "
              f"{regions.get('DDR', 0):>10,} "
              f"{total:>10,}")

    print("-" * 80)
    grand = sum(totals.values())
    print(f"{'TOTAL':<35} "
          f"{totals.get('ITCM', 0):>10,} "
          f"{totals.get('DTCM', 0):>10,} "
          f"{totals.get('DDR', 0):>10,} "
          f"{grand:>10,}")

if __name__ == '__main__':
    usage = parse_map_file(sys.argv[1])
    print_report(usage)

This script does three things. First, parse_map_file reads the map file line by line, looking for lines that match the format of a symbol placement entry (a section name, an address, a size, and an object file name). For each match, it converts the hex address to an integer, determines which memory region it falls in using the addr_to_region helper, and accumulates the size into a nested dictionary keyed by object file and region.

Second, print_report sorts the object files by total memory usage (largest first), prints the top 20, and shows how much each one uses in each region.

Third, the if __name__ == '__main__' block makes the script runnable from the command line.

You'll need to adjust the address ranges in the regions dictionary to match your chip's memory map.

Run it with:

python3 parse_map.py firmware.map

Sample output:

Object File                              ITCM       DTCM        DDR      Total
--------------------------------------------------------------------------------
bluetooth_stack.o                      42,380     65,200     38,400    146,080
audio_processing.o                     89,200     32,000          0    121,200
wifi_driver.o                          21,560     33,632     25,736     80,928
sensor_hub.o                           45,000     18,400          0     63,400
libc.a(memcpy.o)                       12,340          0          0     12,340
...
--------------------------------------------------------------------------------
TOTAL                                 570,936    727,240    622,915  1,921,091

This output shows the top memory consumers in the firmware, sorted by total usage. Each row shows an object file and how many bytes it contributes to each memory region.

The bluetooth_stack.o file is the largest consumer at 146 KB total, spread across all three regions. The audio_processing.o file uses 121 KB, all in ITCM and DTCM (0 bytes in DDR), which makes sense because audio processing is time-critical and was placed entirely in TCM. The libc.a(memcpy.o) entry shows a C library function that was placed in ITCM, likely because it is called from performance-critical code paths.

Method 3: The `size` Command

For a quick check without parsing the map file, use arm-none-eabi-size:

arm-none-eabi-size -A firmware.elf

Output:

firmware.elf  :
section               size        addr
.itcm_text          570936           0
.dtcm_data          530240   536870912
.dtcm_bss           196000   537401152
.stack                8192   537600000
.ddr_text           422915  2147483648
.ddr_data           120000  2147906563
.ddr_bss             80000  2148026563
Total              1928283

This output lists every section in the ELF binary, its size in bytes, and its starting address (shown in decimal).

You can map sections to memory regions by looking at the address: addresses near 0 are ITCM, addresses near 536 million (0x20000000) are DTCM, and addresses near 2.1 billion (0x80000000) are DDR.

Alternatively, the section names themselves indicate the region (.itcm_text is in ITCM, .dtcm_data and .dtcm_bss are in DTCM, .ddr_text and .ddr_data and .ddr_bss are in DDR).

The -A flag gives per-section sizes instead of the default BSD-format output. It's less detailed than the map file approach, but it runs instantly and gives you the big picture.

Method 4: Runtime Stack Profiling

Static analysis (map files, size output) tells you about compile-time placement. But some memory usage is dynamic, particularly the stack, which grows and shrinks at runtime based on call depth and local variable sizes. A function that allocates a 2 KB local buffer only uses that stack space while it is executing, so static analysis can't tell you the peak stack usage.

A common technique is stack watermarking: fill the entire stack region with a known pattern at boot, then periodically check how much of the pattern has been overwritten.

#define STACK_FILL_PATTERN 0xDEADBEEF

void stack_watermark_init(void) {
    extern uint32_t __stack_start;
    extern uint32_t __stack_end;
    uint32_t *p = &__stack_start;

    register uint32_t sp asm("sp");
    while (p < (uint32_t *)(sp - 64)) {
        *p++ = STACK_FILL_PATTERN;
    }
}

uint32_t stack_usage_bytes(void) {
    extern uint32_t __stack_start;
    extern uint32_t __stack_end;
    uint32_t *p = &__stack_start;

    while (p < &__stack_end && *p == STACK_FILL_PATTERN) {
        p++;
    }

    return (uint32_t)(&__stack_end) - (uint32_t)p;
}

void check_stack_health(void) {
    uint32_t used = stack_usage_bytes();
    uint32_t total = 8192;
    uint32_t percent = (used * 100) / total;

    if (percent > 80) {
        log_warning("Stack usage: %lu / %lu bytes (%lu%%)",
                    used, total, percent);
    }
}

The stack_watermark_init function fills the stack memory (from __stack_start to just below the current stack pointer) with the pattern 0xDEADBEEF. The extern declarations reference the linker symbols defined in the linker script's .stack section. The register uint32_t sp asm("sp") line reads the current stack pointer value so the function knows where to stop filling (you do not want to overwrite your own stack frame). The 64-byte safety margin ensures the fill loop doesn't get too close to the active stack.

The stack_usage_bytes function scans from the bottom of the stack upward, counting how many words still contain the fill pattern. The first word that does not match the pattern indicates the deepest point the stack has reached (the high-water mark). The function returns the number of bytes from that point to the top of the stack.

The check_stack_health function computes the percentage of stack used and logs a warning if it exceeds 80%. Call this function periodically during normal operation to monitor stack usage.

Call stack_watermark_init() as early as possible in your startup code (before main() if you can), then call check_stack_health() periodically during normal operation. This tells you the high-water mark, the maximum stack depth your firmware has reached so far.

Method 5: Tracking Memory Across Builds

Every time you add a feature or merge a change, run the memory profile before and after:

arm-none-eabi-size -A firmware_before.elf > mem_before.txt
arm-none-eabi-size -A firmware_after.elf > mem_after.txt
diff mem_before.txt mem_after.txt

These three commands capture the section sizes of two firmware builds (before and after a change) into text files, then diff them to see what changed. This is useful but the raw diff output can be hard to read. The following script provides a cleaner view by computing the delta per memory region:

#!/bin/bash
# memory_diff.sh - Compare memory usage between two builds

echo "Memory Impact of Change:"
echo "========================"

parse_size() {
    arm-none-eabi-size -A "$1" | awk '
    /\.itcm/  { itcm += $2 }
    /\.dtcm/  { dtcm += $2 }
    /\.ddr/   { ddr += $2 }
    /\.stack/ { dtcm += $2 }
    END { printf "%d %d %d", itcm, dtcm, ddr }
    '
}

read itcm_before dtcm_before ddr_before <<< \((parse_size "\)1")
read itcm_after  dtcm_after  ddr_after  <<< \((parse_size "\)2")

printf "ITCM: %+d bytes (%d -> %d)\n" \
    \(((itcm_after - itcm_before)) \)itcm_before $itcm_after
printf "DTCM: %+d bytes (%d -> %d)\n" \
    \(((dtcm_after - dtcm_before)) \)dtcm_before $dtcm_after
printf "DDR:  %+d bytes (%d -> %d)\n" \
    \(((ddr_after - ddr_before)) \)ddr_before $ddr_after

This script takes two ELF files as arguments (the "before" and "after" builds). The parse_size function runs arm-none-eabi-size -A on the given ELF file and uses awk to sum up section sizes by memory region. Sections whose names contain .itcm are counted toward ITCM, sections containing .dtcm or .stack toward DTCM, and sections containing .ddr toward DDR. The main body reads the before and after values, then prints the delta for each region with a + or - sign.

Usage and output:

$ ./memory_diff.sh firmware_without_bt.elf firmware_with_bt.elf

Memory Impact of Change:
========================
ITCM: +63940 bytes (506996 -> 570936)
DTCM: +98832 bytes (628408 -> 727240)
DDR:  +64136 bytes (558779 -> 622915)

This output shows that adding the Bluetooth feature increased ITCM by about 62 KB, DTCM by about 96 KB, and DDR by about 62 KB. You can put this in your CI/CD pipeline so that every pull request shows exactly how much memory it costs.

Method 6: Automated Memory Budget Checks in CI

You can integrate memory profiling into your CI/CD pipeline to catch overflows before they land in your main branch.

#!/bin/bash
# memory_check.sh - Fail CI if memory usage exceeds thresholds

ITCM_LIMIT=85   # percent
DTCM_LIMIT=80
DDR_LIMIT=90

check_region() {
    local name=\(1 used=\)2 total=\(3 limit=\)4
    local percent=$((used * 100 / total))

    if [ \(percent -ge \)limit ]; then
        echo "FAIL: \(name usage is \){percent}% (limit: ${limit}%)"
        echo "      Used: \(used / \)total bytes"
        return 1
    else
        echo "OK:   \(name usage is \){percent}% (limit: ${limit}%)"
        return 0
    fi
}

ITCM_USED=\((grep "ITCM:" firmware.map | awk '{print \)2}')
ITCM_TOTAL=$((2 * 1024 * 1024))

DTCM_USED=\((grep "DTCM:" firmware.map | awk '{print \)2}')
DTCM_TOTAL=1572608

DDR_USED=\((grep "DDR:" firmware.map | awk '{print \)2}')
DDR_TOTAL=$((4 * 1024 * 1024))

FAILED=0
check_region "ITCM" \(ITCM_USED \)ITCM_TOTAL $ITCM_LIMIT || FAILED=1
check_region "DTCM" \(DTCM_USED \)DTCM_TOTAL $DTCM_LIMIT || FAILED=1
check_region "DDR"  \(DDR_USED  \)DDR_TOTAL  $DDR_LIMIT  || FAILED=1

exit $FAILED

This script reads memory usage numbers from the linker map file and compares them against configurable percentage thresholds. The check_region function takes a region name, the number of bytes used, the total bytes available, and the percentage limit. It computes the actual percentage and prints either "OK" or "FAIL" along with the numbers. If any region exceeds its limit, the script exits with a non-zero status, which causes the CI build to fail.

The thresholds at the top (85% for ITCM, 80% for DTCM, 90% for DDR) should be adjusted based on your project's growth rate and how much headroom you want to maintain. DTCM has a lower limit because it fills up faster and is harder to free up.

Add this script to your build pipeline so every pull request shows its memory cost. If a change pushes any region past its threshold, the build fails and the developer knows immediately.

Method 7: Heap Tracking at Runtime

If your embedded project uses dynamic memory allocation (malloc/free), you can wrap the allocator to track usage.

static size_t heap_used = 0;
static size_t heap_peak = 0;

void *tracked_malloc(size_t size) {
    size_t *block = (size_t *)malloc(size + sizeof(size_t));
    if (!block) return NULL;

    *block = size;
    heap_used += size;
    if (heap_used > heap_peak) {
        heap_peak = heap_used;
    }

    return (void *)(block + 1);
}

void tracked_free(void *ptr) {
    if (!ptr) return;
    size_t *block = ((size_t *)ptr) - 1;
    heap_used -= *block;
    free(block);
}

void print_heap_stats(void) {
    printf("Heap: current=%zu bytes, peak=%zu bytes\n",
           heap_used, heap_peak);
}

This code wraps malloc and free with tracking logic. The tracked_malloc function allocates slightly more memory than requested (an extra sizeof(size_t) bytes) and stores the requested size in the first word of the allocation. It then updates the heap_used counter and, if the new total exceeds the previous peak, updates heap_peak. It returns a pointer that's offset past the size header, so the caller sees a normal pointer to their data.

The tracked_free function reverses the process: it subtracts one size_t from the pointer to find the hidden size header, subtracts that size from heap_used, and calls the real free on the original block.

The print_heap_stats function prints the current and peak heap usage. Call it periodically or on demand through a debug interface (UART console, debug CLI) to monitor how much heap your firmware is using.

This approach has a small overhead (one extra word per allocation), but it gives you visibility into dynamic memory usage that's otherwise completely invisible. It's especially useful for tracking down memory leaks: if heap_used keeps growing over time without ever decreasing, something is allocating without freeing.

Summary

Embedded processors based on ARM Cortex-M and Cortex-R architectures give you direct control over three memory regions with very different performance characteristics.

ITCM (Instruction Tightly-Coupled Memory) stores your most performance-critical code. It provides single-cycle, deterministic instruction fetch. It's small (typically 512 KB to 2 MB), so reserve it for ISRs, real-time processing functions, and hot loops.

DTCM (Data Tightly-Coupled Memory) stores your most performance-critical data. It also provides single-cycle, deterministic access. Your stack lives here by default. It's even smaller than ITCM and fills up quickly, so be deliberate about what you place in it.

DDR (Double Data Rate) memory stores everything else. It's much larger but slower (5 to 20+ cycles per access, with variable latency). Use it for initialization code, large buffers, protocol stacks, and anything that doesn't need deterministic timing.

You control placement through __attribute__((section(...))) in your C code and section-to-region mappings in your linker script. You verify placement through map files, the size command, and runtime profiling techniques like stack watermarking. The core skill is knowing which region each piece of your firmware belongs in, and having the tooling to catch mistakes early.

Understanding Escape Analysis in Go – Explained with Example Code

Eti Ijeoma — Thu, 12 Feb 2026 18:31:12 +0000

In most languages, the stack and heap are two ways a program stores data in memory, managed by the language runtime. Each is optimized for different use cases, such as fast access or flexible lifetimes.

Go follows the same model, but you usually don’t decide between the stack and the heap directly. Instead, the Go compiler decides where values live. If the compiler can prove a value is only needed within the current function call, it can keep it on the stack. If it cannot prove that, the value “escapes” and is placed on the heap. This technique is called escape analysis.

This matters because heap allocations increase garbage collector work. In code that runs often, that extra work can show up as more CPU spent in GC, more allocations, and less predictable performance.

In this article, I’ll explain what escape analysis is, the common patterns that trigger heap allocation, and how to confirm and reduce avoidable allocations.

Prerequisites
Do You Really Need to Care About Escape Analysis?
Memory Layout and Lifecycle
Sharing Down and Sharing Up
Escape Analysis in Practice
How to Use Escape Analysis to Guide Performance
Conclusion
Further Reading

Prerequisites

Familiarity with Go fundamentals (functions, variables, structs, slices, maps)
Basic understanding of pointers in Go (& and *)
A general idea of how goroutines work

Do You Really Need to Care About Escape Analysis?

Before we go deeper, I want to call this out clearly. For the correctness of your program, it doesn’t matter whether a variable lives on the stack or on the heap, or whether you know that detail. The Go compiler is smart enough to place values where they need to be so that your program behaves correctly.

Most of the time, you don’t need to think about this at all. It only starts to matter when performance becomes a problem. If your program is already fast enough, you’re done, and there’s no point trying to squeeze out extra speed.

You should only start caring about stack vs heap when you have benchmarks that show your program is too slow, and those same benchmarks point to heavy heap allocation and garbage collection as part of the problem.

Memory Layout and Lifecycle

To get a better understanding of what escape analysis is, you first need a simple picture of how Go lays out memory while your program runs. At this level, it comes down to the stack each goroutine uses, how stack frames are carved out of that stack, and when values move to the heap where the garbage collector can see them.

Goroutine Stacks and Stack Frames

When a Go program starts, the runtime creates the main goroutine, and every go statement creates a new goroutine, each with its own stack.

There’s not a single global stack for the whole process. As of writing this article, with Go v1.25.7, each goroutine gets an initial contiguous block of 2,048 bytes of memory, which acts as its stack. The stack is where Go stores data that belongs to function calls. When a goroutine calls a function, Go reserves a chunk of that goroutine’s stack for the function’s local data. That chunk is called a stack frame.

It holds the function’s local variables and the call state needed to return and continue execution. If that function calls another function, a new frame is added on top. When the inner function returns, its frame becomes invalid, and the goroutine continues in the caller’s frame.

A stack frame only lives for as long as the function is active. Once the function returns, anything stored in its frame is considered invalid, even if the raw bytes are still in memory and will be reused later. Code must not rely on those values after the return

Go stacks can grow. A goroutine starts with a small stack and the runtime grows it when needed, but the lifetime rule stays the same. A value is safe in a stack frame only if nothing can still reference it after the function returns. If it might be referenced later, it can’t stay in that frame and must be placed somewhere safer.

Pointers and Lifetime

In Go, taking an address like p := &x means you now have a pointer in one stack frame that refers to a value which may have been created in another frame. When you pass that pointer into a function, Go still passes by value. The callee gets its own pointer variable on its own stack frame, but the address inside still points to the same underlying value. So pointers are how you share access to one value across several frames without copying the value itself.

Lifetime becomes important when a pointer can outlive the frame where the pointed value was created. As long as both the pointer and the value live inside frames that are still active in the current call stack, everything is safe.

Once a pointer might still exist after the original frame has returned, the value can no longer stay in that frame, because that frame will become invalid. At that point, the value has to be placed in a safer location so that no pointer ever points into dead stack memory.

Now that you have a picture of stacks, frames, and pointers, we can look at two common ways pointers move through your code. I’ll call them sharing down and sharing up. The names aren’t special Go terms. They’re just a simple way to describe how a pointer moves along the call stack.

Sharing down means a function passes a pointer or reference to functions it calls. The pointer moves deeper into the call stack, but the value it points to still belongs to a frame that is active.

Example code:

package main

import "fmt"

func main() {
    n := 10
    multiply(&n) 
}

func multiply(v *int) {
   *v = *v * 2
}

In main, you take the address of n and pass it into multiply. While multiply runs, both the main frame and the multiply frames are active. The pointer in multiply points to a value that still lives in an active frame, so this situation is safe from a lifetime point of view.

In the diagram below, after the multiply function runs and returns, the multiply frame becomes invalid, and we don’t need to do anything because the stack pointer is simply popped back to the previous frame's address. This action automatically reclaims all the memory used by that function in one step, so the garbage collector is not involved in cleaning up stack memory

Sharing up means a function returns a pointer, or stores it somewhere that will still be around after the function returns. The pointer moves back up the call stack or into some longer-lived state while the frame that created the value is about to end, so that value can no longer be tied to that one frame.

The same idea shows up when you share a value with another goroutine, because Go doesn’t let one goroutine hold pointers into another goroutine’s stack, so shared data needs a lifetime that is not tied to a single stack.

Heap, garbage collection, and lifetime

Values that might outlive a single stack frame can’t stay in that frame. The compiler places them on the heap instead. The heap is a separate region of memory that isn’t tied to one function call. Any goroutine can hold pointers to heap values, and those values stay valid as long as something in the program can still reach them. You can think of the heap as storage for “might live longer than this call”.

The garbage collector is what keeps this safe. Periodically, the runtime starts from a set of roots (global variables, active stack frames, some internal state) and follows all the pointers it can see. Any heap value that is still reachable is kept. Any heap value that is no longer reachable is treated as garbage and its memory is reclaimed.

This means a pointer in main will never legally point into dead stack memory. Either the value stayed in an active frame, or it was placed on the heap where the GC can track its lifetime. The tradeoff is that more heap allocations and longer-lived objects require the GC to do more work.

Here’s an example:

package main

import "fmt"

type Car struct {
    Brand string
    Model string
}

func main() {
    // main receives a pointer from a function it called and this is sharing up
    carPtr := makeCar("Volkswagen", "Golf") 

    fmt.Printf("I received a car: %s %s\n", carPtr.Brand, carPtr.Model)
}

func makeCar(b, m string) *Car {
    myCar := Car{
        Brand: b,
        Model: m,
    }
    return &myCar
}

In the above code:

In makeCar (the callee frame), Go creates a local variable myCar. Because you return &myCar, the compiler allocates the Car value on the heap, and let’s myCar hold the heap address 0xc00029fa0.
When makeCar returns, that address is copied into carPtr in main (the top frame). carPtr is just another stack variable, but its value is still 0xc00029fa0, so now main also points to the same heap Car.
On the right, the heap bubble shows the actual Car value at 0xc00029fa0. Both car (while makeCar is running) and carPtr (after it returns) reach that same value through their pointers.
Once makeCar is done, its frame drops into the “invalid memory” region, but the Car stays alive on the heap because main still holds carPtr. That’s the escape: the value stops being tied to the callee frame and gets heap lifetime instead.

Escape Analysis in Practice

Escape analysis is how the Go compiler decides whether a value lives on the stack or on the heap. It’s not only about returning pointers – it follows how addresses move through your code. If a value might outlive the current function, the compiler can’t keep it in that stack frame and moves it to the heap. Since only the compiler sees the full picture, the useful thing is to ask it to show these decisions and then link them back to your code.

To do that, we can pass compiler flags using -gcflags when running go build or go run. If you want to see the available options, you can check go tool compile -h. In that list, -m prints the compiler’s optimisation decisions, including escape analysis output. If you want more details, you can use -m=2 or -m=3 for a more verbose output. The -l flag disables inlining, so the report is easier to read because the compiler is not merging small functions into their callers.

So, the command will look like this:

go run -gcflags='all=-m -l' .

Or for a build:

go build -gcflags='all=-m -l' .

How to Use Escape Analysis to Guide Performance

You can think of escape analysis as the thing that turns your code choices into GC work. When a value escapes, it gets heap lifetime, and the garbage collector has to visit it. In hot paths, lots of small escaping values show up as extra GC time and jitter in latency. When a value stays in a stack frame, it becomes invalid and dies with the frame and the GC does not care about it.

Here are five simple practices that help performance without making

Prefer values for small data: If the function doesn’t need to mutate the caller’s data, use value types for small structs and basic types when passing arguments and returning results. It’s cheap to copy an int or a small struct, and it often keeps lifetimes local to a single call.
Use pointers when sharing or mutation is part of the design: opt for pointers when you genuinely need shared mutable state or want to avoid copying large structs.
Avoid creating long-lived references by accident: Be careful when returning pointers to locals, capturing variables in closures, or storing addresses in long-lived structs, maps, or interfaces. These patterns are the ones most likely to push values out of a stack frame.
Pass in reusable buffers on hot paths: On code paths that run very often, the problem is usually not one big allocation, but many small ones happening in a loop. A common cause is functions that always create a new buffer inside, even when the caller could have passed one in.

A simple way to cut those extra allocations is to let the caller own the buffer. The caller allocates a []byte once, then passes it into the function each time. The function only fills the buffer instead of creating a new one.

Here’s an example of how a bad function allocates a new buffer every call:
```
 package main

 // Bad: helper allocates every call.
 func fillBad() []byte {
     buf := make([]byte, 4096)
     // pretend we read into it
     buf[0] = 1
     return buf
 }

 func hotPathBad() {
     for i := 0; i < 1_000_000; i++ {
         b := fillBad() // allocates 1,000,000 times
         _ = b
     }
 }

 func main() {
     hotPathBad()
 }
```
When we run escape analysis with this:
```
 go run -gcflags='-m -l' .
```
We see the following:
```
 ./main.go:5:13: make([]byte, 4096) escapes to heap
```
If we were only allocating a few times, we could choose not to worry – but the real problem is how this looks inside the loop. hotPathBad calls fillBad on every iteration, so each call allocates a new 4 KB slice on the heap. If this loop runs many times, you end up creating a lot of short-lived heap objects. The garbage collector then has to find and clean up all those buffers, which adds extra work that you could have avoided by reusing a single buffer.

Here’s an example of a better version where the caller allocates once and reuses:
```
 package main

 func fill(buf []byte) int {
     // pretend we read into it
     buf[0] = 1
     return 1
 }

 func hotPath() {
     buf := make([]byte, 4096) 

     for i := 0; i < 1_000_000; i++ {
         n := fill(buf) 
         _ = buf[:n]
     }
 }

 func main() {
     hotPath()
 }
```
In this version, hotPath controls the buffer. It allocates buf once, then passes it into fill on every loop. You still read the same data, but you avoid creating a new slice on each call. That reduces avoidable allocations in the hot path.

Conclusion

In Go, where a value ends up is not decided by how you create it. It’s decided by how long that value must remain valid and how it is referenced as your code runs.

The practical takeaway is not to avoid pointers. It’s to be deliberate about lifetime. Value semantics can keep lifetimes tight and reduce GC work, while pointers can be the right choice when you need shared state or in-place updates. The balance is to write the clear version first, then look at your benchmarks and profiles to see if anything actually really needs to change.

Embedded Swift: A Modern Approach to Low-Level Programming

Soham Banerjee — Sat, 02 Aug 2025 00:45:59 +0000

Embedded programming has long been dominated by C and C++, powering everything from microcontrollers to real-time systems. While these languages offer unmatched low-level control, they also introduce persistent challenges, manual memory management, unsafe pointer operations, and subtle logic bugs stemming from weak type systems and undefined behavior.

With the release of Swift 6 and its new Embedded Swift compilation mode, developers now have access to a modern, memory-safe, and performant alternative that’s tailored specifically for resource-constrained systems.

While languages like Rust have also emerged to address these issues, Embedded Swift brings the clarity and safety of Swift to microcontroller environments, without giving up on determinism, binary size, or hardware access.

This article introduces Embedded Swift and explores how it compares to traditional C/C++ development. We’ll cover its key features, programming and memory models, how to set up the toolchain for STM32 microcontrollers, and how to link Swift with existing C drivers.

Along the way, we’ll examine performance trade-offs, growing ecosystem support, and the broader industry movement toward memory-safe languages. As I hope you’ll see, Swift is a serious contender in the future of embedded development.

Prerequisites

To get the most out of this article, you should have a basic understanding of programming in Swift and C. Familiarity with embedded hardware platforms and firmware development concepts will also be helpful.

If you're new to embedded systems, consider reviewing this introductory guide to embedded firmware to build foundational knowledge before diving into Embedded Swift.

Scope

This article is intended as a practical introduction to Embedded Swift. It covers:

An overview of Embedded Swift and its key language features
Swift’s programming and memory model in an embedded context
Setting up the Embedded Swift toolchain on macOS for STM32 microcontrollers
Interoperability with C code and linking to existing low-level drivers
A look at memory and instruction-level performance
Future directions and use cases for Embedded Swift

Note that this article does not provide a full tutorial on the Swift language itself. While the primary focus is on STM32, similar principles apply to other supported platforms such as ESP32, Raspberry Pi Pico, and nRF52.

What is Swift? What is Embedded Swift?

Swift is a modern programming language developed by Apple that combines the performance of compiled languages with the expressiveness and safety of modern language design. While Swift was originally created for iOS and macOS development, it has evolved into a powerful general-purpose language used in server-side development, systems programming, and increasingly, embedded systems.

Embedded Swift is a special compilation mode introduced in Swift 6 that brings the benefits of Swift to resource-constrained platforms like microcontrollers. It lets developers use a safe, high-level language while still producing compact, deterministic, and performant binaries suitable for embedded applications.

Key Features of Swift

Embedded Swift retains many of the powerful language features that make Swift an attractive alternative to C/C++ in embedded development:

Type Safety: Swift uses a strong static type system, which prevents many programming errors at compile time. Unlike C, where type mismatches can result in undefined behavior, Swift ensures all types are used correctly before code even runs.

Strict Type Checking: Swift doesn't allow implicit type conversions that could lose data or cause unexpected behavior. For example:

// This won't compile in Swift
let integer: Int = 42
let decimal: Double = 3.14
let result = integer + decimal  // Error: Cannot convert value of type 'Int' to expected argument type 'Double'

// You must be explicit about conversions
let result = Double(integer) + decimal  // Correct

Non-nullable Types by Default: In C, pointers can be null by default, which introduces risk. In Swift, variables cannot be nil unless explicitly marked as optionals:

var name: String = "John"
name = nil  // Compile error - String cannot be nil

var optionalName: String? = "John"
optionalName = nil  // This is allowed

Memory Safety via ARC (Covered in detail later):

Swift manages memory automatically using Automatic Reference Counting (ARC). Unlike manual memory management in C/C++, ARC handles object lifecycles efficiently without unpredictable garbage collection pauses. We'll cover ARC and its impact in embedded contexts in a dedicated section later.

Modern Syntax:
Swift's syntax is clean, consistent, and designed for readability. It supports modern paradigms including:

Functional programming (map, filter, reduce)
Generics (type-safe abstractions)
Protocol-Oriented Programming (discussed in the next section)

These features allow you to write more expressive and maintainable code compared to procedural C or inheritance-heavy C++.

Performance:
Swift is designed to perform on par with C++ in many scenarios. Optimizations such as inlining, dead code elimination, and static dispatch help ensure that high-level abstractions don’t compromise performance. In embedded mode, Swift disables features like runtime reflection and dynamic dispatch to further reduce overhead.

To fully leverage Swift for embedded development, it's important to understand its programming model. Unlike C’s procedural approach or C++’s class-heavy design, Swift promotes protocol-oriented programming and composition, which offers both flexibility and safety in embedded system design.

Swift Programming Model

Swift embraces a multi-paradigm programming model that blends object-oriented, functional, and protocol-oriented programming, all underpinned by strong type safety and memory safety.

For embedded developers coming from C or C++, this model may feel different at first. But it provides a more modular and testable way to build complex systems, something especially valuable in embedded applications where hardware abstraction and strict reliability are critical.

Protocol-Oriented Programming (POP)

Swift emphasizes protocols over inheritance, encouraging developers to define behaviors through protocols and implement them using value types like struct and enum, rather than relying heavily on classes.

This philosophy favors composition over inheritance, allowing you to build complex functionality by combining smaller, well-defined components.

Key Concepts:

protocol defines required behavior.
Protocol extensions provide default behavior.
Prefer value semantics using struct.

Example:

protocol Speakable {
    func speak()
}

extension Speakable {
    func speak() {
        print("Default sound")
    }
}

struct Dog: Speakable {
    func speak() {
        print("Woof!")
    }
}

Embedded Swift uses protocols with static dispatch. With static dispatch, the compiler knows the exact memory address of the function to call and can generate a direct jump instruction. There's no runtime lookup, no indirection, and no uncertainty.

Why POP Matters for Embedded Systems

First, you get flexible hardware extraction. Protocols make it easy to define interfaces for hardware components, allowing for mock implementations during testing or platform-specific variations.

Second, you have nice low overhead. Embedded Swift uses static dispatch for protocols, meaning there’s no runtime lookup, and calls are resolved at compile time for maximum performance.

Also, struct and enum types avoid heap allocations, making code more efficient and predictable in low-memory environments.

Now that we’ve explored how Swift’s programming model enables safer and more modular embedded code, let’s turn to another critical piece of the puzzle: memory management. Swift’s use of Automatic Reference Counting (ARC) replaces manual memory handling and offers important benefits, and tradeoffs, for embedded systems.

Swift Memory Management

One of Swift’s most impactful features, especially in the context of embedded systems, is its use of Automatic Reference Counting (ARC) for memory management. Unlike C/C++, where memory must be manually allocated and freed using malloc and free, Swift automates this process while maintaining deterministic performance.

This automation significantly reduces the risk of common memory-related bugs like leaks, dangling pointers, or use-after-free errors, all of which are notorious in low-level C code.

How ARC works

Swift supports ARC not only for the Cocoa Touch API's but for all APIs, providing a streamlined approach to memory management. Unlike garbage collection systems that can cause unpredictable pauses, ARC works deterministically at compile time and runtime to manage memory.

ARC automatically tracks and manages the lifetime of objects in memory based on how many references point to them.

Reference Counting: Every object has a counter that tracks how many strong references point to it.
Retain / Release: The compiler inserts retain and release calls automatically during assignment and deinitialization.
Immediate Deallocation: When the reference count reaches zero, the object is deallocated immediately.
Deterministic: Unlike garbage collectors, ARC doesn’t introduce unpredictable pauses or runtime scanning.

Swift offers multiple reference types to give you precise control over memory behavior and prevent cycles:

Strong References (default)

Keeps the referenced object alive.
Used in most cases.

class MotorController {
    var sensor: SensorData?  // Strong reference

    func updateReading(newData: SensorData) {
        self.sensor = newData  // Previous sensor data automatically deallocated
    }
}

Weak References

Used to break reference cycles (especially in two-way object relationships).
Automatically becomes nil when the referenced object is deallocated.

class Device {
    var controller: MotorController?

    deinit {
        print("Device deallocated")
    }
}

class MotorController {
    weak var device: Device?  // ← Weak reference breaks the cycle

    deinit {
        print("MotorController deallocated")
    }
}

func breakCycle() {
    let device = Device()
    let controller = MotorController()

    device.controller = controller
    controller.device = device  // ← This is now a weak reference

    // When this function ends, both objects are properly deallocated
}

breakCycle()
// Output:
// Device deallocated
// MotorController deallocated

Unowned References

Non-optional version of weak.
Assumes the object will never be deallocated while still in use.
More lightweight than weak, but unsafe if misused.

class SensorSystem {
    unowned let controller: MotorController  // unowned reference

    init(controller: MotorController) {
        self.controller = controller
    }
}

class MotorController {
    var sensorSystem: SensorSystem?

    func setupSensors() {
        sensorSystem = SensorSystem(controller: self)
    }

    deinit {
        print("MotorController deallocated")
    }
}

func testUnowned() {
    let controller = MotorController()
    controller.setupSensors()
    // sensorSystem deallocates before controller ends
}

testUnowned()
// Output: MotorController deallocated

ARC Overhead in Embedded Systems

While ARC provides safety benefits, it does introduce some overhead compared to manual memory management:

Memory Overhead:

ARC-managed class instances in Swift typically include an additional 4 or 8 bytes to store reference count metadata, depending on the system architecture, 4 bytes on 32-bit systems and 8 bytes on 64-bit systems. This metadata allows the runtime to track how many active references exist to a given object and deallocate it when no references remain. When developers use weak or unowned references, the memory footprint increases further. These references require additional data structures, such as side tables or tracking mechanisms, to manage object liveness and cleanup. In the case of weak references specifically, Swift maintains zeroing weak reference tables that automatically null out pointers once the referenced object is deallocated, ensuring memory safety.

CPU Overhead:

ARC introduces some runtime overhead due to retain and release operations, which are inserted automatically during reference assignments. These operations involve incrementing or decrementing the reference count and are especially common in code that passes objects between functions or stores them in collections. To ensure thread safety, these updates are typically implemented using atomic operations, which add further instruction cycles. In complex object graphs, ARC may also engage in cycle detection and cleanup through the use of weak references to prevent memory leaks caused by strong reference cycles. While Swift's ARC provides deterministic and efficient memory management, it does so with both memory and CPU costs that developers should consider carefully, especially in performance-critical embedded systems.

Type Safety and Error Prevention

Swift's type system prevents many common errors that plague C/C++ programs:

Buffer Overflows: Swift arrays are bounds-checked, preventing buffer overflow vulnerabilities that are common in C.
Null Pointer Dereferences: Swift's optional types make null pointer dereferences impossible at compile time.
Use After Free: Swift's ownership model prevents use-after-free errors that can cause crashes or security vulnerabilities.

Now that we’ve covered Swift's memory model and ARC behavior, let’s explore how it compares to C in terms of memory usage and instruction cycles, a crucial aspect when evaluating Embedded Swift for real-world deployment.

Memory and Instruction Cycle Comparison

Understanding the performance characteristics of Swift versus C is essential for embedded systems, where every instruction cycle and byte of memory matters. While Swift brings advantages like safety and expressiveness, these benefits come with certain trade-offs in terms of memory usage and runtime behavior that embedded developers must evaluate carefully.

Memory Management:

Swift uses Automatic Reference Counting (ARC) to manage memory. ARC tracks the number of references to each object and deallocates it when no references remain. This eliminates the need for explicit free() calls but introduces overhead.

C, in contrast, uses manual memory management. Developers allocate memory using malloc and release it using free, or rely on the stack for most short-lived data.

The table below provides the memory management comparison between Swift and C:

Feature	Swift (ARC)	C (Manual)
Memory strategy	Automatic reference counting	Manual with `malloc`/`free`
Overhead per object	4–8 bytes (for ref count)	None for stack; variable for heap
Deallocation	Deterministic, triggered by ARC	Developer-controlled
Weak reference support	Requires additional metadata	Not built-in
Thread safety	Atomic operations in ARC	Not guaranteed
Layout control	Limited, compiler-managed	Full control (via structs/pointers)

Swift ensures safety through deterministic cleanup and predictable memory usage. But this comes at the cost of added memory and CPU overhead.

C’s approach offers complete control over memory layout and minimal runtime cost, but increases the risk of memory leaks and fragmentation without disciplined practices.

Instruction Cycle Analysis

The safety features in Swift, such as bounds checking, optional unwrapping, and ARC updates, translate into additional CPU instructions. While this can impact performance, the Swift compiler is aggressive about optimization in release builds. For example, inlining and ARC elision can remove much of the overhead in performance-critical paths.

C has no built-in safety checks, allowing it to generate highly efficient, predictable code. Developers can even use inline assembly for tight control over performance.

The table below provides the instruction cycle comparison between Swift and C:

Instruction-Level Feature	Swift	C
Reference count updates	2–4 instructions per assignment	N/A
Bounds checking	1–3 instructions per array access	None
Optional unwrapping	1–2 instructions per check	N/A
Method dispatch	Protocols introduce indirection	Direct calls or function pointers
Optimization potential	ARC elision, inlining, dead code removal	Full manual control, inline assembly
Predictability	High in optimized builds, with some abstraction overhead	Very high, minimal abstraction

Although Swift inserts extra instructions for safety, much of this cost can be mitigated through compiler optimization.

C has no such features by default, making it ideal for applications where performance must be tightly controlled and the developer is willing to take full responsibility for safety.

Instruction Count Comparison: Swift vs C Loop Performance

When evaluating Swift and C for embedded use, it's helpful to analyze instruction-level performance on basic operations, such as a loop that processes an array of floating-point numbers. This gives us a concrete sense of the computational cost of each language's safety and abstraction features.

Let’s consider a simple example: summing an array of Float values and returning the average. In Swift, the code uses a high-level for-in loop over an array:

Simple loop performance:

// Swift loop with safety checks
func processData(_ data: [Float]) -> Float {
    var sum: Float = 0.0
    for value in data {  // Iterator with bounds checking
        sum += value     // Safe arithmetic
    }
    return sum / Float(data.count)  // Safe division
}
// Estimated: ~8-10 instructions per iteration

Although elegant and safe, this loop includes several safety mechanisms:

Bounds checking on every array access
Reference counting if data is passed as a reference type
Overflow protection in debug mode
Optional handling or runtime checks if data might be empty

These checks introduce runtime overhead, resulting in an estimated 8–10 instructions per iteration on most platforms (depending on optimization level and target architecture). In release builds, Swift aggressively inlines and strips redundant checks, but some level of abstraction cost remains, especially compared to raw memory access in C.

Now, compare that to its equivalent in C:

// C loop without safety checks
float process_data(float* data, int count) {
    float sum = 0.0f;
    for (int i = 0; i < count; i++) {  // Direct pointer arithmetic
        sum += data[i];                // Direct memory access
    }
    return sum / count;  // Direct division (no safety check)
}
// Estimated: ~4-5 instructions per iteration

This version performs direct memory access with pointer arithmetic, no bounds checks, and no type safety. The C code is lower-level, with fewer runtime checks, and compiles down to just 4–5 instructions per iteration, depending on the target CPU and compiler flags. It is lean and fast, ideal for cycles-per-instruction-critical scenarios.

The table below shows the comparison of single loop performance between Swift and C:

Aspect	Swift	C
Array access	Bounds-checked	Direct pointer access
Loop iteration	High-level iterator abstraction	Raw loop with pointer increment
Instruction count (per loop)	~8–10 (in debug), ~6–8 (in release)	~4–5
Division	Safe (avoids divide-by-zero in dev)	Direct
Overflow behavior	Checked in debug, unchecked in release	Unchecked
Readability and safety	High	Low
Performance	Lower (but optimizable)	Higher (manual)

Now that we’ve compared Swift and C in terms of memory and cycle costs, let’s move into the practical side: how to set up Embedded Swift on an STM32 platform and get started with real-world development.

How to Setup Embedded Swift

In this section, we'll walk through how to configure and use Embedded Swift for development on STM32 microcontrollers. STM32 is a popular family of ARM Cortex-M–based microcontrollers, commonly used in industrial, consumer, and IoT applications.

Prerequisites

Required Software:

Swift Development Snapshot (includes the Embedded Swift toolchain)
Swiftly - Easiest way to manage and install swift toolchains
Swiftc - Swift Compiler command-line tool
Python3 - Required to run scripts to convert Mach-O to binary files
Git (to clone sample repositories) like https://github.com/swiftlang/swift-embedded-examples
A Unix-like development environment (macOS is currently best supported)

Target Hardware: This guide focuses on STM32 microcontrollers, which are widely used in embedded applications and have excellent community support.

This guide walks you through the full setup process, from installing the required Swift toolchain to flashing the final binary onto your board. We’ll begin by installing the Swift Development Snapshot using Swiftly, a simple command-line utility for managing Swift toolchains. From there, we’ll configure the build system, set up the correct board variant, customize the build script, and compile the Swift and C source code into a binary. Finally, we’ll flash the firmware onto the STM32 using standard tools

Install Swift Development Snapshot

The easiest way to install and manage Embedded Swift toolchains is by using the swiftly tool, which simplifies downloading and using Swift snapshots.

macOS Installation:

The below steps will help install the Swift embedded toolchain:

# Using Swiftly (Recommended)
curl -O https://download.swift.org/swiftly/darwin/swiftly.pkg
installer -pkg swiftly.pkg -target CurrentUserHomeDirectory
~/.swiftly/bin/swiftly init --quiet-shell-followup
source "${SWIFTLY_HOME_DIR:-$HOME/.swiftly}/env.sh"

# Install and use development snapshot
swiftly install main-snapshot
swiftly use main-snapshot

# Verify installation
swift --version

You can clone this Github example repository:

git clone https://github.com/swiftlang/swift-embedded-examples.git 
cd swift-embedded-examples/projects/stm32-blink

The stm32-blink contains:

Swift code that toggles GPIOs
A C startup file with vector table
A build.sh script that uses swiftc, clang, and a custom linker setup

Setup the STM32 Board

Tell the build script which STM32 board is being used:

export STM_BOARD=STM32F746G_DISCOVERY

You can add your own board variant by defining the appropriate memory map and compiler flags in the script.

Modify build.sh (Optional)

Ensure the script correctly locates the following:

swiftc: should point to the toolchain you installed with Swiftly
clang: can be macOS’s default Clang
libBuiltin.a, crt0.s, and macho2bin.py: used to provide minimal runtime support and convert output to flashable binaries

If needed, update these paths:

SWIFT_EXEC=${SWIFT_EXEC:-$(swiftly which swiftc)}
CLANG_EXEC=${CLANG_EXEC:-$(xcrun -f clang)}
PYTHON_EXEC=${PYTHON_EXEC:-$(which python3)}

Ensure the linker flags match your target’s flash and RAM sizes.

Build and Flash the Project:

Run:

./build.sh

This compiles Swift and C code, links them, and produces a blink.bin file.

If successful, you’ll see:

.build/blink.bin  # ready to flash Step 6: Flash the Firmware to STM32

Use ST-Link tools or openocd to flash your board. Example using st-flash:

brew install stlink
st-flash write .build/blink.bin 0x8000000

You should now see an LED blinking.

Here’s a more detailed step by step approach to writing a bare metal code on STM32. For comprehensive installation guides covering other platforms (Raspberry Pi Pico, ESP32, nRF52), detailed IDE configuration, troubleshooting, and advanced examples, you can check out the official documentation:

Complete Setup Guide: Install Embedded Swift
Platform Examples: Swift Embedded Examples Repository
Getting Started Tutorial: Embedded Swift on Microcontrollers

Now that we’ve set up Embedded Swift and explored how to build and run an example project, let’s look at a critical real-world scenario: interfacing Swift with low-level C drivers.

C-Swift Linkages

In many embedded projects, low-level hardware drivers are written in C because of its close-to-metal control and widespread ecosystem support. Embedded Swift supports seamless interoperability with C, which lets you reuse existing C libraries and drivers, write hardware control logic in C, and implement higher-level application logic in Swift.

This hybrid model lets you combine Swift’s safety and productivity with C’s hardware-level control, with no runtime overhead or object translation.

Let’s walk through an example where a low-level sensor driver is implemented in C and the application logic is written in Swift.

C Header File (sensor_driver.h):

This C header file defines the public interface for a low-level sensor driver. It includes standard fixed-width integer types and declares four functions:

sensor_init(): Initializes the hardware sensor
sensor_read_temperature() and sensor_read_humidity(): Read raw sensor values
sensor_delay_ms(): Delays execution for a given number of milliseconds

This interface acts as a bridge between Swift and C. Swift will link to these functions by name, no wrappers or bindings required.

#ifndef SENSOR_DRIVER_H
#define SENSOR_DRIVER_H

#include 

// Low-level sensor driver functions
void sensor_init(void);
uint32_t sensor_read_temperature(void);
uint32_t sensor_read_humidity(void);
void sensor_delay_ms(uint32_t milliseconds);

#endif

C Implementation (sensor_driver.c):

This implementation assumes the sensor is memory-mapped at a fixed address (0x40001000). Each register, temperature, humidity, and control, is accessed by offset from that base address.

The sensor_init() function writes 0x01 to the control register, presumably enabling or starting the sensor hardware.

The sensor_read_temperature() method and sensor_read_humidity() method reads from memory-mapped registers and return the raw ADC values from the sensor.

The sensor_delay_ms() method performs a simple busy-wait loop using nop (no-operation) instructions to approximate a delay. This is suitable for short, coarse-grained delays in bare-metal contexts.

#include "sensor_driver.h"

// Hardware register addresses
#define SENSOR_BASE_ADDR    0x40001000
#define TEMP_REG_OFFSET     0x00
#define HUMIDITY_REG_OFFSET 0x04
#define CONTROL_REG_OFFSET  0x08

void sensor_init(void) {
    // Initialize sensor hardware
    volatile uint32_t* control_reg = (volatile uint32_t*)(SENSOR_BASE_ADDR + CONTROL_REG_OFFSET);
    *control_reg = 0x01; // Enable sensor
}

uint32_t sensor_read_temperature(void) {
    volatile uint32_t* temp_reg = (volatile uint32_t*)(SENSOR_BASE_ADDR + TEMP_REG_OFFSET);
    return *temp_reg;
}

uint32_t sensor_read_humidity(void) {
    volatile uint32_t* humidity_reg = (volatile uint32_t*)(SENSOR_BASE_ADDR + HUMIDITY_REG_OFFSET);
    return *humidity_reg;
}

void sensor_delay_ms(uint32_t milliseconds) {
    // Simple delay implementation
    for (uint32_t i = 0; i < milliseconds * 1000; i++) {
        __asm__("nop");
    }
}

Swift Code Using C Driver:

To use these C functions from Swift, you declare them using @_silgen_name, which tells the Swift compiler to link directly to these symbol names at runtime.

The SensorController class encapsulates sensor-related logic. In its init() method, it calls the sensor_init() function defined in C to initialize the sensor hardware.

The readSensors() method reads the raw values from the C driver, converts them into human-readable units using helper functions, stores them internally, and returns the processed values.

The convertTemperature() and convertHumidity() conversion methods apply a basic linear formula to turn raw ADC values into temperature in Celsius and humidity in percentage, respectively. These formulas would be based on the specific sensor’s datasheet.

The checkThresholds() method applies simple threshold logic, a good example of where Swift’s readability and type safety shine. You could easily expand this logic to include error bounds, state machines, or alerts.

// Import C driver functions

/*
These declarations match the C function signatures exactly. 
They allow Swift to invoke the C functions as if they were native Swift functions 
— with zero overhead.
*/
@_silgen_name("sensor_init")
func sensor_init()

@_silgen_name("sensor_read_temperature")
func sensor_read_temperature() -> UInt32

@_silgen_name("sensor_read_humidity")
func sensor_read_humidity() -> UInt32

@_silgen_name("sensor_delay_ms")
func sensor_delay_ms(_ ms: UInt32)

// Swift sensor controller using C driver
class SensorController {
    private var lastTemperature: Float = 0.0
    private var lastHumidity: Float = 0.0

    init() {
        // Initialize the C driver
        sensor_init()
    }

    func readSensors() -> (temperature: Float, humidity: Float) {
        // Read raw values from C driver
        let rawTemp = sensor_read_temperature()
        let rawHumidity = sensor_read_humidity()

        // Convert raw values to meaningful units in Swift
        let temperature = convertTemperature(rawValue: rawTemp)
        let humidity = convertHumidity(rawValue: rawHumidity)

        // Store for comparison
        lastTemperature = temperature
        lastHumidity = humidity

        return (temperature: temperature, humidity: humidity)
    }

    private func convertTemperature(rawValue: UInt32) -> Float {
        // Convert raw ADC value to Celsius
        return (Float(rawValue) * 3.3 / 4095.0 - 0.5) * 100.0
    }

    private func convertHumidity(rawValue: UInt32) -> Float {
        // Convert raw ADC value to percentage
        return Float(rawValue) * 100.0 / 4095.0
    }

    func checkThresholds() -> Bool {
        // Swift logic for threshold checking
        let tempThreshold: Float = 25.0
        let humidityThreshold: Float = 60.0

        return lastTemperature > tempThreshold || lastHumidity > humidityThreshold
    }
}

// Main application loop
func main() -> Never {
    let sensorController = SensorController()

    while true {
        // Read sensors using Swift controller with C driver
        let readings = sensorController.readSensors()

        // Process data with Swift's type safety and expressiveness
        if sensorController.checkThresholds() {
            print("Warning: Temperature: \(readings.temperature)°C, Humidity: \(readings.humidity)%")
        } else {
            print("Normal: Temperature: \(readings.temperature)°C, Humidity: \(readings.humidity)%")
        }

        // Delay using C driver function
        sensor_delay_ms(1000) // 1 second delay
    }
}

The func main() is the main event loop standard for embedded systems. It creates the sensor controller, reads sensor data in a loop, checks thresholds, and prints results accordingly. The loop includes a delay (via the C driver) to avoid hammering the sensor continuously.

In an actual embedded context, instead of using print(), you might blink an LED, send UART messages, or log data to memory.

With Embedded Swift and C now working together, let’s explore what lies ahead. The next section outlines ongoing improvements, emerging use cases, and research directions that are shaping the future of Embedded Swift.

Future Work

Embedded Swift is still a young but rapidly evolving technology. Its modern language features, type safety, and performance make it an attractive option for embedded development, and ongoing work is expanding its capabilities, reach, and ecosystem.

Ongoing Improvements

Compiler Optimizations: The Swift compiler team is actively improving code generation for embedded targets, including:

Reducing binary size
Minimizing ARC overhead
Improving static dispatch performance

Hardware Support: Embedded Swift can target a wide variety of ARM and RISC-V microcontrollers, which are popular for building industrial applications. Support for additional architectures is being developed.

Tooling Enhancements: Tooling support for Embedded Swift is still evolving, but several community-driven and open-source efforts are making development more accessible:

Build Systems: The Swift Embedded Working Group provides example projects that adapt Swift Package Manager (SwiftPM) for cross-compilation. Custom linker scripts and build helpers are available for platforms like STM32 and nRF52.
Debugging Support: Developers can debug Embedded Swift programs using existing tools like GDB or OpenOCD, provided the build includes appropriate debug symbols. While not yet officially streamlined, this approach enables step-through debugging on real hardware.
IDE Integration: There is no official IDE support yet, but some developers use VSCode with Swift syntax highlighting and external build tasks. These setups are still manual but serve as early prototypes for embedded workflows.

Emerging Use Cases

There are a number of emerging use cases for embedded Swift. For example, Swift’s memory safety, type guarantees, and protocol-oriented design make it ideal for secure and scalable IoT devices, especially where firmware bugs could affect user safety or privacy.

The automotive sector is also exploring Swift for infotainment systems, driver assistance features, and safety-critical logic (where deterministic execution and safety matter).

Swift’s expressive syntax and compile-time safety make it suitable for industrial automation – think real-time control loops, sensor fusion systems, and edge devices in smart manufacturing.

It’s also useful for medical devices, as it aligns well with strict medical regulations around memory safety, type guarantees, and predictable resource usage.

Community and Ecosystem

Open Source Projects

The Swift Embedded working group maintains example repositories showcasing how to use Embedded Swift on microcontrollers such as STM32, nRF52, and ESP32. Early-stage libraries for UART, GPIO, and basic peripherals are emerging, though the ecosystem is still young compared to C or Rust.

Learning Resources

While Embedded Swift is not yet widely taught in formal curricula, community tutorials and exploratory projects (for example, Swift for Arduino) are lowering the barrier for hobbyists and independent learners. As tooling matures, educational adoption is likely to follow.

Industry Interest

Embedded Swift is beginning to draw attention from developers and companies looking for safer, more maintainable alternatives to C. Although large-scale adoption remains limited, use cases like rapid prototyping, IoT development, and internal experimentation are gaining traction.

Conclusion

Embedded Swift represents a major step forward in embedded programming. By combining the power and safety of Swift with the low-level control needed for microcontrollers, it offers an exciting alternative to traditional C and C++ development.

While C will remain essential for hardware-level programming and performance-critical paths, Swift brings compelling advantages to many embedded scenarios:

Memory safety: Swift eliminates entire categories of bugs such as buffer overflows, use-after-free, and null pointer dereferencing.
Type safety: Many logic errors are caught at compile time, long before they can cause runtime failures.
Modern language features: Developers can use functional paradigms, generics, and protocol-oriented design even in embedded code.
C interoperability: Swift works seamlessly with existing C libraries, allowing gradual adoption without rewriting low-level drivers.
Developer productivity: Clear syntax, automatic memory management, and strong tooling lead to faster development and easier maintenance.

Government and regulatory bodies are increasingly encouraging or mandating the use of memory-safe programming languages to reduce vulnerabilities in critical software systems. For example:

In 2022, the U.S. National Security Agency (NSA) recommended moving away from unsafe languages like C/C++ for new software projects, promoting memory-safe alternatives.
In June 2025, the NSA and CISA released a joint Cybersecurity Information Sheet titled “Memory Safe Languages: Reducing Vulnerabilities in Modern Software Development”, which emphasized that memory safety flaws remain a persistent risk, and organizations should develop strategies to adopt memory-safe programming languages in new systems.
The U.S. Cybersecurity and Infrastructure Security Agency (CISA) and NIST have echoed similar guidance in the context of national cybersecurity.

While these documents do not mention Swift explicitly, Swift's strong type system, ARC-based memory model, and compile-time safety guarantees align closely with the goals outlined in these recommendations. As such, it offers a practical, developer-friendly path toward safer embedded development.

Swift may not be the right fit for every embedded system. In applications where every byte of memory or instruction cycle is critical, real-time guarantees are hard requirements, or toolchain maturity is essential (for example, RTOS integration, static analyzers), C or Rust may still be preferred.

But in many modern embedded applications, especially those involving rapid prototyping, fast product iteration, safety-critical or maintainable firmware, and interoperability with existing C codebases, Swift offers a highly productive and safe development experience.

Embedded Swift is still maturing, but its momentum is undeniable. With ongoing compiler work, community-driven examples, and growing interest from developers, it’s poised to play a major role in the future of embedded systems.

Whether you're building an IoT device, a piece of industrial equipment, or a proof-of-concept wearable, Swift can help you write safer, more expressive firmware, without giving up performance or control.

Swift can be especially powerful during the prototyping phase, when the primary goal is to validate functionality quickly and safely. And with its increasing support for multiple hardware platforms, it offers a strong foundation for bringing modern software development practices to the embedded world.

How AI Agents Remember Things: The Role of Vector Stores in LLM Memory

Manish Shivanandhan — Thu, 17 Jul 2025 14:08:19 +0000

When you talk to an AI assistant, it can feel like it remembers what you said before.

But large language models (LLMs) don’t actually have memory on their own. They don’t remember conversations unless that information is given to them again.

So, how do they seem to recall things?

The answer lies in something called a vector store – and that’s what you’ll learn about in this article.

What Is a Vector Store?
How Embeddings Work
Why Vector Stores Are Crucial for Memory
Popular Vector Stores
- FAISS (Facebook AI Similarity Search)
- Pinecone
Making AI Seem Smart with Retrieval-Augmented Generation
The Limits of Vector-Based Memory
Conclusion

What Is a Vector Store?

A vector store is a special type of database. Instead of storing text or numbers like a regular database, it stores vectors.

A vector is a list of numbers that represents the meaning of a piece of text. You get these vectors using a process called embedding.

The model takes a sentence and turns it into a high-dimensional point in space. In that space, similar meanings are close together.

For example, if I embed “I love sushi,” it might be close to “Sushi is my favourite food” in vector space. These embeddings help an AI agent find related thoughts even if the exact words differ.

How Embeddings Work

Let’s say a user tells an assistant:

“I live in Austin, Texas.”

The model turns this sentence into a vector:

[0.23, -0.41, 0.77, ..., 0.08]

This vector doesn’t mean much to us, but to the AI, it’s a way to capture the sentence’s meaning. That vector gets stored in a vector database, along with some extra info – maybe a timestamp or a note that it came from this user.

Later, if the user says:

“Book a flight to my hometown.”

The model turns this new sentence into a new vector. It then searches the vector database to find the most similar stored vectors.

The closest match might be “I live in Austin, Texas.” Now the AI knows what you probably meant by “my hometown.”

This ability to look up related past inputs based on meaning – not just matching keywords – is what gives LLMs a form of memory.

Why Vector Stores Are Crucial for Memory

LLMs process language using a context window. That’s the amount of text they can “see” at once.

For GPT-4-turbo, the window can handle up to 128,000 tokens, which sounds huge – but even that gets filled fast. You can’t keep the whole conversation there forever.

Instead, you use a vector store as long-term memory. You embed and save useful info.

Then, when needed, you query the vector store, retrieve the top relevant pieces, and feed them back into the LLM. This way, the model remembers just enough to act smart – without holding everything in its short-term memory.

Popular Vector Stores

There are several popular vector databases in use. Each one has its strengths.

FAISS (Facebook AI Similarity Search)

FAISS is an open-source library developed by Meta. It’s fast and works well for local or on-premise applications.

FAISS is great if you want full control and don’t need cloud hosting. It supports millions of vectors and provides tools for indexing and searching with high performance.

Here’s how you can use FAISS:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load a pre-trained sentence transformer model that converts sentences to numerical vectors (embeddings)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define the input sentence we want to store in memory
sentence = "User lives in Austin, Texas"

# Convert the sentence into a dense vector (embedding)
embedding = model.encode(sentence)

# Get the dimensionality of the embedding vector (needed to create the FAISS index)
dimension = embedding.shape[0]

# Create a FAISS index for L2 (Euclidean) similarity search using the embedding dimension
index = faiss.IndexFlatL2(dimension)

# Add the sentence embedding to the FAISS index (this is our "memory")
index.add(np.array([embedding]))

# Encode a new query sentence that we want to match against the stored memory
query = model.encode("Where is the user from?")

# Search the FAISS index for the top-1 most similar vector to the query
D, I = index.search(np.array([query]), k=1)

# Print the index of the most relevant memory (in this case, only one item in the index)
print("Most relevant memory index:", I[0][0])

This code uses a pre-trained model to turn a sentence like “User lives in Austin, Texas” into an embedding.

It stores this embedding in a FAISS index. When you ask a question like “Where is the user from?”, the code converts that question into another embedding and searches the index to find the stored sentence that’s most similar in meaning.

Finally, it prints the position (index) of the most relevant sentence in the memory.

FAISS is efficient, but it’s not hosted. That means you need to manage your own infrastructure.

Pinecone

Pinecone is a cloud-native vector database. It’s managed for you, which makes it great for production systems.

You don’t need to worry about scaling or maintaining servers. Pinecone handles billions of vectors and offers filtering, metadata support, and fast queries. It integrates well with tools like LangChain and OpenAI.

Here’s how a basic Pinecone setup works:

import pinecone
from sentence_transformers import SentenceTransformer

# Initialize Pinecone with your API key and environment
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")

# Connect to or create a Pinecone index named "memory-store"
index = pinecone.Index("memory-store")

# Load a pre-trained sentence transformer model to convert text into embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Convert a fact/sentence into a numerical embedding (vector)
embedding = model.encode("User prefers vegetarian food")

# Store (upsert) the embedding into Pinecone with a unique ID
index.upsert([("user-pref-001", embedding.tolist())])

# Encode the query sentence into an embedding
query = model.encode("What kind of food does the user like?")

# Search Pinecone to find the most relevant stored embedding for the query
results = index.query(queries=[query.tolist()], top_k=1)

# Print the ID of the top matching memory
print("Top match ID:", results['matches'][0]['id'])

Pinecone is ideal if you want scalability and ease of use without managing hardware.

Other popular vector stores include:

Weaviate – Combines vector search with knowledge graphs. Offers strong semantic search with hybrid keyword support.
Chroma – Simple to use and good for prototyping. Often used in personal apps or demos.
Qdrant – Open-source and built for high-performance vector search with filtering.

Each of these has its place depending on whether you need speed, scale, simplicity, or special features.

Making AI Seem Smart with Retrieval-Augmented Generation

This whole system – embedding user inputs, storing them in a vector database, and retrieving them later – is called retrieval-augmented generation (RAG).

The AI still doesn’t have a brain, but it can act like it does. You choose what to remember, when to recall it, and how to feed it back into the conversation.

If the AI helps a user track project updates, you can store each project detail as a vector. When the user later asks, “What’s the status of the design phase?” you search your memory database, pull the most relevant notes, and let the LLM stitch them into a helpful answer.

The Limits of Vector-Based Memory

While vector stores give AI agents a powerful way to simulate memory, this approach comes with some important limitations.

Vector search is based on similarity, not true understanding. That means the most similar stored embedding may not always be the most relevant or helpful in context. For instance, two sentences might be mathematically close in vector space but carry very different meanings. As a result, the AI can sometimes surface confusing or off-topic results, especially when nuance or emotional tone is involved.

Another challenge is that embeddings are static snapshots. Once stored, they don’t evolve or adapt unless explicitly updated. If a user changes their mind or provides new information, the system won’t "learn" unless the original vector is removed or replaced. Unlike human memory, which adapts and refines itself over time, vector-based memory is frozen unless developers actively manage it.

There are a few ways you can mitigate these challenges.

One is to include more context in the retrieval process, such as filtering results by metadata like timestamps, topics, or user intent. This helps narrow down results to what’s truly relevant at the moment.

Another approach is to reprocess or re-embed older memories periodically, ensuring that the information reflects the most current understanding of the user’s needs or preferences.

Beyond technical limitations, vector stores also raise privacy and ethical concerns. Key questions are: Who decides what gets saved? How long should that memory persist? And does the user have control over what is remembered or forgotten?

Ideally, these decisions should not be made solely by the developer or system. A more thoughtful approach is to make memory explicit. Let users choose what gets remembered. For example, by marking certain inputs as “important”, it adds a layer of consent and transparency. Similarly, memory retention should be time-bound where appropriate, with expiration policies based on how long the information remains useful.

Equally important is the ability for users to view, manage, or delete their stored data. Whether through a simple interface or a programmatic API, memory management tools are essential for trust. As the use of vector stores expands, so does the expectation that AI systems will respect user agency and privacy.

The broader AI community is still shaping best practices around these issues. But one thing is clear: simulated memory should be designed not just for accuracy and performance, but for accountability. By combining strong defaults with user control, developers can ensure vector-based memory systems are both smart and responsible.

Conclusion

Vector stores give AI agents a way to fake memory – and they do it well. By embedding text into vectors and using tools like FAISS or Pinecone, we give models the power to recall what matters. It’s not real memory. But it makes AI systems feel more personal, more helpful, and more human.

As these tools grow more advanced, so does the illusion. But behind every smart AI is a simple system of vectors and similarity. If you can master that, you can build assistants that remember, learn, and improve with time.

Hope you enjoyed this article. Connect with me on Linkedin.

How to Debug and Prevent Buffer Overflows in Embedded Systems

Soham Banerjee — Mon, 17 Mar 2025 16:34:42 +0000

Buffer overflows are one of the most serious software bugs, especially in embedded systems, where hardware limitations and real-time execution make them hard to detect and fix.

A buffer overflow happens when a program writes more data into a buffer than it was allocated, leading to memory corruption, crashes, or even security vulnerabilities. A buffer corruption occurs when unintended modifications overwrite unread data or modify memory in unexpected ways.

In safety-critical systems like cars, medical devices, and spacecraft, buffer overflows can cause life-threatening failures. Unlike simple software bugs, buffer overflows are unpredictable and depend on the state of the system, making them difficult to diagnose and debug.

To prevent these issues, it's important to understand how buffer overflows and corruptions occur, and how to detect and fix them.

Article Scope

In this article, you will learn:

What buffers, buffer overflows, and corruptions are. I’ll give you a beginner-friendly explanation with real-world examples.
How to debug buffer overflows. You’ll learn how to use tools like GDB, LLDB, and memory maps to find memory corruption.
How to prevent buffer overflows. We’ll cover some best practices like input validation, safe memory handling, and defensive programming.

I’ll also show you some hands-on code examples – simple C programs that demonstrate buffer overflow issues and how to fix them.

What this article doesn’t cover:

Security exploits and hacking techniques. We’ll focus on preventing accidental overflows, not hacking-related buffer overflows.
Operating system-specific issues. This guide is for embedded systems, not general-purpose computers or servers.
Advanced RTOS memory management. While we discuss interrupt-driven overflows, we won’t dive deep into real-time operating system (RTOS) concepts.

Now that you know what this article covers (and what it doesn’t), let’s go over the skills that will help you get the most out of it.

Prerequisites

This article is designed for developers who have some experience with C programming and want to understand how to debug and prevent buffer overflows in embedded systems. Still, beginners can follow along, as I’ll explain key concepts in a clear and structured way.

Before reading, it helps if you know:

Basic C programming.
How memory works – the difference between stack, heap, and global variables.
Basic debugging concepts – if you’ve used a debugger like GDB or LLDB, that’s a plus, but not required.
What embedded systems are – a basic idea of how microcontrollers store and manage memory.

Even if you’re not familiar with these topics, this guide will walk you through them in an easy-to-understand way.

Before you dive into buffer overflows, debugging, and prevention, let’s take a step back and understand what a buffer is and why it’s important in embedded systems. Buffers play a crucial role in managing data flow between hardware and software but when handled incorrectly, they can lead to serious software failures.

What is a Buffer, and How Does it Work?
What is a Buffer Overflow?
Common Causes of Buffer Overflows and Corruption
Consequences of Buffer Overflows
How to Debug Buffer Overflows
How to Prevent Buffer Overflows
Conclusion

What is a Buffer, and How Does it Work?

A buffer is a contiguous block of memory used to temporarily store data before it is processed. Buffers are commonly used in two scenarios:

Data accumulation: When the system needs to collect a certain amount of data before processing.
Rate matching: When the data producer generates data faster than the data consumer can process it.

Buffers are typically implemented as arrays in C, where elements are indexed from 0 to N-1 (where N is the buffer size).

Let’s look at an example of a buffer in a sensor system.

Consider a system with a sensor task that generates data at 400 Hz (400 samples per second or 1 sample every 2.5 ms). But the data processor (consumer) operates at only 100 Hz (100 samples per second or 1 sample every 10 ms). Since the consumer task is slower than the producer, we need a buffer to store incoming data until it is processed.

To determine the buffer size, we calculate:

Buffer Size = Time to consume 1 sample / Time to generate 1 sample = 10 ms/ 2.5 ms = 4

This means the buffer must hold at least 4 samples at a time to avoid data loss.

Once the buffer reaches capacity, there are several strategies to decide which data gets passed to the consumer task:

Max/min sampling: Use the maximum or minimum value in the buffer.
Averaging: Compute the average of all values in the buffer.
Random access: Pick a sample from a specific location (for example, the most recent or the first).

In real-world applications, it’s beneficial to use circular buffers or double buffering to prevent data corruption.

Circular buffer approach: A circular buffer (also called a ring buffer) continuously wraps around when it reaches the end, ensuring old data is overwritten safely without exceeding memory boundaries. The buffer size should be multiplied by 2 (4 × 2 = 8) to hold 8 samples. This allows the consumer task to process 4 samples while the next 4 samples are being filled, preventing data overwrites.
Double buffer approach: Double buffering is useful when data loss is unacceptable. It allows continuous data capture while the processor is busy handling previous data. A second buffer of the same size is added. When the first buffer is full, the write pointer switches to the second buffer, allowing the consumer task to process data from the first buffer while the second buffer is being filled. This prevents data overwrites and ensures a continuous data flow.

Buffers help manage data efficiently, but what happens when they are mismanaged? This is where buffer overflows and corruptions come into play.

What is a Buffer Overflow?

A buffer overflow occurs when a program writes more data into a buffer than it was allocated, causing unintended memory corruption. This can lead to unpredictable behavior, ranging from minor bugs to critical system failures.

To understand buffer overflow, let's use a simple analogy. Imagine a jug with a tap near the bottom. The jug represents a buffer, while the tap controls how much liquid (data) is consumed.

The jug is designed to hold a fixed amount of liquid. As long as water flows into the jug at the same rate or slower than it flows out, everything works fine. But if water flows in faster than it flows out, the jug will eventually overflow.

Similarly, in software, if data enters a buffer faster than it is processed, it exceeds the allocated memory space, causing a buffer overflow. In the case of a circular buffer, this can cause the write pointer to wrap around and overwrite unread data, leading to buffer corruption.

Buffer Overflows in Software

Unlike the jug, where water simply spills over, a buffer overflow in software overwrites adjacent memory locations. This can cause a variety of hard-to-diagnose issues, including:

Corrupting other data stored nearby.
Altering program execution, leading to crashes.
Security vulnerabilities, where attackers exploit overflows to inject malicious code.

When a buffer overflow occurs, data can overwrite variables, function pointers, or even return addresses, depending on where the buffer is allocated.

Buffer overflows can occur in different memory regions:

Buffer overflows in global/static memory (.bss / .data sections)
- These occur when global or static variables exceed their allocated size.
- The overflow can corrupt adjacent variables, leading to unexpected behavior in other modules.
- Debugging is easier because memory addresses are fixed at compile time unless the compiler optimizes them. Map files provide a memory layout of variables during the compilation and linking.
Stack-based buffer overflow (more predictable, easier to debug):
- Happens when a buffer is allocated in the stack (for example, local variables inside functions).
- Overflowing the stack can affect adjacent local variables or return addresses, potentially crashing the program.
- In embedded systems with small stack sizes, this often leads to a crash or execution of unintended code.
Heap-based buffer overflow (harder to debug):
- Happens when a buffer is dynamically allocated in the heap (for example, using malloc() in C).
- Overflowing a heap buffer can corrupt adjacent dynamically allocated objects or heap management structures.
- Debugging is harder because heap memory is allocated dynamically at runtime, causing memory locations to vary.

Buffer Overflow vs Buffer Corruption

Buffer overflow and buffer corruption are of course related, but refer to different situations.

A buffer overflow happens when data is written beyond the allocated buffer size, leading to memory corruption, unpredictable behavior, or system crashes.

A buffer corruption happens when unintended data modifications result in unexpected software failures, even if the write remains within buffer boundaries.

Both issues typically result from poor write pointer management, lack of boundary checks, and unexpected system behavior.

Now that we've covered what a buffer overflow is and how it can overwrite memory, let’s take a closer look at how these issues affect embedded systems.

In the next section, we’ll explore how buffer overflows and corruption happen in real-world embedded systems and break down common causes, including pointer mismanagement and boundary violations.

Common Causes of Buffer Overflows and Corruption

Embedded systems use buffers to store data from sensors, communication interfaces (like UART (Universal Asynchronous Receiver-Transmitter), SPI (Serial Peripheral Interface), I2C (Inter-integrated Circuit), and real-time tasks. These buffers are often statically allocated to avoid memory fragmentation, and many implementations use circular (ring) buffers to efficiently handle continuous data streams.

Here are three common scenarios where buffer overflows or corruptions occur in embedded systems:

Writing Data Larger Than the Available Space

Issue: The software writes incoming data to the buffer without checking if there is enough space.

Example: Imagine a 100-byte buffer to store sensor data. The buffer receives variable-sized packets. If an incoming packet is larger than the remaining space, it will overwrite adjacent memory, leading to corruption.

So why does this happen?

Some embedded designs increment the write pointer after copying data, making it too late to prevent overflow.
Many low-level memory functions (memcpy, strcpy, etc.) do not check buffer boundaries, leading to unintended writes.
Without proper bound checking, a large write can exceed the buffer size and corrupt nearby memory.

Here’s a code sample to demonstrate buffer overflow in a .bss / .data section:

  #include 
  #include 
  #include 

  #define BUFFER_SIZE 300

  static uint16_t sample_count = 0;
  static uint8_t buffer[BUFFER_SIZE] = {0};

  // Function to simulate a buffer overflow scenario
  void updateBufferWithData(uint8_t *data, uint16_t size)
  {
      // Simulating a buffer overflow: No boundary check!
      printf("Attempting to write %d bytes at position %d...\n", size, sample_count);

      // Deliberate buffer overflow for demonstration
      if (sample_count + size > BUFFER_SIZE)
      {
          printf("WARNING: Buffer Overflow Occurred! Writing beyond allocated memory!\n");
      }

      // Copy data (unsafe, can cause overflow)
      memcpy(&buffer[sample_count], data, size);

      // Increment sample count (incorrectly, leading to wraparound issues)
      sample_count += size;
  }

  int main()
  {   
      // Save 1 byte to buffer
      uint8_t data_to_buffer = 10;
      updateBufferWithData(&data_to_buffer, 1);

      // Save an array of 20 bytes to buffer
      uint8_t data_to_buffer_1[20] = {5};
      updateBufferWithData(data_to_buffer_1, sizeof(data_to_buffer_1));

      // Intentional buffer overflow: Save an array of 50 x 8 bytes (400 bytes)
      uint64_t data_to_buffer_2[50] = {7};
      updateBufferWithData((uint8_t*)data_to_buffer_2, sizeof(data_to_buffer_2));

      return 0;
  }

Interrupt-Driven Overflows (Real-time Systems)

Issue: The interrupt service routine (ISR) may write data faster than the main task can process, leading to buffer corruption or buffer overflow if the write pointer is not properly managed.

Example: Imagine a sensor ISR that writes incoming data into a buffer every time a new reading arrives. Meanwhile, a low-priority processing task reads and processes the data.

What can go wrong?

If the ISR triggers too frequently (due to a misbehaving sensor or high interrupt priority), the buffer may fill up faster than the processing task can keep up.
This can result in one of two failures:
1. Buffer Corruption: The ISR overwrites unread data, leading to loss of information.
2. Buffer Overflow: The ISR exceeds buffer boundaries, causing memory corruption or system crashes.

So why does this happen?

In real-time embedded systems, ISR execution preempts lower-priority tasks.
If the processing task doesn't not get enough CPU time, the buffer may become overwritten or overflow beyond its allocated scope.

System State Changes & Buffer Corruption

Issue: The system may unexpectedly reset, enter low-power mode, or changes operating state, leaving the buffer write pointers in an inconsistent state. This can result in buffer corruption (stale or incorrect data) or buffer overflow (writing past the buffer’s limits.

Example Scenarios:

Low-power wake-up issue (Buffer Overflow risk): Some embedded systems enter deep sleep to conserve energy. Upon waking up, if the buffer write pointer is not correctly reinitialized, it may point outside buffer boundaries, leading to buffer overflow and unintended memory corruption.
Unexpected mode transitions: If a sensor task is writing data and the system suddenly switches modes, the buffer states and pointers may not be cleaned up. The next time the sensor task runs, it may continue writing without clearing previous data. This can cause undefined behavior due to presence of stale data.

Now that you understand how buffer overflows and corruptions happen, let’s examine their consequences in embedded systems ranging from incorrect sensor readings to complete system failures, making debugging and prevention critical.

Consequences of Buffer Overflows

Buffer overflows can be catastrophic in embedded systems, leading to system crashes, data corruption, and unpredictable behavior. Unlike general-purpose computers, many embedded devices lack memory protection, making them particularly vulnerable to buffer overflows.

A buffer overflow can corrupt two critical types of memory:

1. Data Variables Corruption

A buffer overflow can overwrite data variables, corrupting the inputs for other software modules. This can cause unexpected behavior or even system crashes if critical parameters are modified.

For example, a buffer overflow could accidentally overwrite a sensor calibration value stored in memory. As a result, the system would start using incorrect sensor readings, leading to faulty operation and potentially unsafe conditions.

2. Function Pointer Corruption

In embedded systems, function pointers are often used for interrupt handlers, callback functions, and RTOS task scheduling. If a buffer overflow corrupts a function pointer, the system may execute unintended instructions, leading to a crash or unexpected behavior.

As an example, a function pointer controlling motor speed regulation could be overwritten. Instead of executing the correct function, the system would jump to a random memory address, causing a system fault or erratic motor behavior.

Buffer overflows are among the hardest bugs to identify and fix because their effects depend on which data is corrupted and the values it contains. A buffer overflow can affect memory in different ways:

If a buffer overflow corrupts unused memory, the system may seem fine during testing, making the issue harder to detect.
if a buffer overflow alters critical data variables, it can cause hidden logic errors that cause unpredictable behavior.
If a buffer overflow corrupts function pointers, it may crash immediately, making the problem easier to identify.

During development, if tests focus only on detecting crashes, they may overlook silent memory corruption caused by a buffer overflow. In real-world deployments, new use cases not covered in testing can trigger previously undetected buffer overflow issues, leading to unpredictable failures.

Buffer overflows can cause a chain reaction, where one overflow leads to another overflow or buffer corruption, resulting in widespread system failures. So how does this happen?

A buffer overflow corrupts a critical variable (for example, a timer interval).
The corrupted variable disrupts another module (for example, triggers the timer interrupt too frequently, causing it to push more data into a buffer than intended.).
This increased interrupt frequency forces a sensor task to write data faster than intended, eventually causing another buffer overflow or corruption by overwriting unread data.

This chain reaction can spread across multiple software modules, making debugging nearly impossible. In real-word applications, buffer overflows in embedded systems can be life-threatening:

In cars: A buffer overflow in an ECU (Electronic Control Unit) could cause brake failure or unintended acceleration.
In a spacecraft: A memory corruption issue could disable navigation systems, leading to mission failure.

Now that we’ve seen how buffer overflows can corrupt memory, disrupt system behavior, and even cause critical failures, the next step is understanding how to detect and fix them before they lead to serious issues.

How to Debug Buffer Overflows

Debugging buffer overflows in embedded systems can be complex, as their effects range from immediate crashes to silent data corruption, making them difficult to trace. A buffer overflow can cause either:

A system crash, which is easier to detect since it halts execution or forces a system reboot.
Unexpected behavior, which is much harder to debug as it requires tracing how corrupted data affects different modules.

This section focuses on embedded system debugging techniques using memory map files, debuggers (GDB/LLDB), and a structured debugging approach. Let’s look into the debuggers and memory map files.

Memory Map File (.map file)

A memory map file is generated during the linking process. It provides a memory layout of global/static variables, function addresses, and heap/stack locations. It provides a memory layout of Flash and RAM, including:

Text section (.text): Stores executable code.
Read-only section (.rodata): Stores constants and string literals.
BSS section (.bss): Stores uninitialized global and static variables.
Data section (.data): Stores initialized global and static variables.
Heap and stack locations, depending on the linker script.

If a buffer overflow corrupts a global variable, the .map file can identify nearby variables that may also be affected, provided the compiler has not optimized the memory allocation. Similarly, if a function pointer is corrupted, the .map file can reveal where it was stored in memory.

Debuggers (GDB & LLDB)

Debugging tools like GDB (GNU Debugger) and LLDB (LLVM Debugger) allow:

Controlling execution (breakpoints, stepping through code).
Inspecting variable values and memory addresses.
Getting backtraces (viewing function calls before a crash).
Extracting core dumps from microcontrollers for post-mortem analysis.

If the system halts on a crash, a backtrace (bt command in GDB) can reveal which function was executing before failure. If the overflow affects a heap-allocated variable, GDB can inspect heap memory usage to detect corruption.

The Debugging Process

Now, let’s go through a step-by-step debugging process to identify and fix buffer overflows. Once a crash or unexpected behavior occurs, follow these techniques to trace the root cause:

Step 1: Identify the misbehaving module

If the system crashes, use GDB or LLDB backtrace (bt command) to locate the last executed function. If the system behaves unexpectedly, determine which software module controls the affected functionality.

Step 2: Analyze inputs and outputs of the module

Every function or module has inputs and outputs. Create a truth table listing expected outputs for all possible inputs. Check if the unexpected behavior matches any undefined input combination, which may indicate corruption.

Step 3: Locate memory corruption using address analysis

If a variable shows incorrect values, determine its physical memory location. Depending on where the variable is stored:

Global/static variables (.bss / .data): Look up the memory map file for nearby buffers.

Heap variables: Snapshot heap allocations using GDB.

Here’s an example of using GDB to find corrupted variables:

 (gdb) print &my_variable  # Get memory address of the variable
 $1 = (int *) 0x20001000
 (gdb) x/10x 0x20001000   # Examine memory near this address, Display 10 memory words in hexadecimal format starting from 0x20001000

Step 4: Identify the overflowing buffer

If a buffer is located just before the corrupted variable, inspect its usage in the code. Review all possible code paths that write to the buffer. Check if any design limitations could cause an overflow under a specific use cases.

Step 5: Fix the root cause

If the buffer overflow happened due to missing bounds checks, add proper input validation to prevent it. Buffer design should enforce strict memory limits. The module should implement strict boundary checks for all inputs and maintain a consistent state.

In addition to GDB/LLDB, you can also use techniques like hardware tracing and fault injection to simulate buffer overflows and observe system behavior in real-time.

While debugging helps identify and fix buffer overflows, prevention is always the best approach. Let’s explore techniques that can help avoid buffer overflows altogether.

How to Prevent Buffer Overflows

You can often prevent buffer overflows through good software design, defensive programming, hardware protections, and rigorous testing. Embedded systems, unlike general-purpose computers, often lack memory protection mechanisms, which means that buffer overflow prevention critical for system reliability and security.

Here are some key techniques to help prevent buffer overflows:

Defensive Programming

Defensive programming helps minimize buffer overflow risks by ensuring all inputs are validated and unexpected conditions are handled safely.

First, it’s crucial to validate input size before writing to a buffer. Always check the write index by adding the size of data to be written prior to writing data to make sure more data is not written than the available buffer space.

Then you’ll want to make sure you have proper error handling and fail-safe mechanisms in place. If an input is invalid, halt execution, log the error, or switch to a safe state. Also, functions should indicate success/failure with helpful error codes to prevent misuse.

Sample Code:

   #include 
   #include 
   #include 
   #include 

   #define BUFFER_SIZE 300

   static uint16_t sample_count = 0;
   static uint8_t buffer[BUFFER_SIZE] = {0};

   typedef enum
   {
       SUCCESS = 0,
       NOT_ENOUGH_SPACE = 1,
       DATA_IS_INVALID = 2,
   } buffer_err_code_e;


   buffer_err_code_e updateBufferWithData(uint8_t *data, uint16_t size)
   {
       if (data == NULL || size == 0 || size > BUFFER_SIZE)  
       {
           return DATA_IS_INVALID; // Invalid input size
       }

       uint16_t available_space = BUFFER_SIZE - sample_count;
       bool can_write = (available_space >= size) ? true : false;

       if (!can_write)  
       {
           return NOT_ENOUGH_SPACE;
       }

       // Copy data safely
       memcpy(&buffer[sample_count], data, size);
       sample_count += size;

       return SUCCESS;
   }

   int main()
   {   
       buffer_err_code_e ret;

       // Save 1 byte to buffer
       uint8_t data_to_buffer = 10;
       ret = updateBufferWithData(&data_to_buffer, sizeof(data_to_buffer));
       if (ret)  
       {
           printf("Buffer update didn't succeed, Err:%d\n", ret);
       }

       // Save an array of 20 bytes to buffer
       uint8_t data_to_buffer_1[20] = {5};
       ret = updateBufferWithData(data_to_buffer_1, sizeof(data_to_buffer_1));
       if (ret)  
       {
           printf("Buffer update didn't succeed, Err:%d\n", ret);
       }

       // Save an array of 50 x 8 bytes, Intentional buffer overflow
       uint64_t data_to_buffer_2[50] = {7};
       ret = updateBufferWithData((uint8_t*)data_to_buffer_2, sizeof(data_to_buffer_2));  
       if (ret)  
       {
           printf("Buffer update didn't succeed, Err:%d\n", ret);
       }

       return 0;
   }

Choosing the Right Buffer Design And Size

Some buffer designs handle overflow better than others. Choosing the correct buffer type and size for the application reduces the risk of corruption.

Circular Buffers (Ring Buffers) prevent out-of-bounds writes by wrapping around. They overwrite the oldest data instead of corrupting memory. These are useful for real-time streaming data (for example, UART, sensor readings). This approach is ideal for applications where data loss is unacceptable.
Ping-Pong Buffers (Double Buffers) use two buffers. One buffer fills up with data. Then, once it’s full, it switches to the second buffer while the first one is processed. This approach is beneficial for application that have strict requirements on no data loss. The buffer design should be based on the speed of write and read tasks.

Hardware Protection

Memory Protection Unit (MPU)

An MPU (Memory Protection Unit) helps detect unauthorized memory accesses, including buffer overflows, by restricting which regions of memory can be written to. It prevents buffer overflows from modifying critical memory regions and triggers a MemManage Fault if a process attemps to write outside an allowed region.

But keep in mind that, an MPU does not prevent buffer overflows – it only detects and stops execution when they occur. Not all microcontrollers have an MPU, and some low-end MCUs lack hardware protection, making software-based safeguards even more critical.

Modern C compilers provide several flags to identify memory errors at compile-time:

-Wall -Wextra: Enables useful warnings
-Warray-bounds: Detects out-of-bounds array access when the array size is known at compile-time
-Wstringop-overflow: Warns about possible overflows in string functions like memcpy and strcpy.

Testing and Validation

Testing helps detect buffer overflows before deployment, reducing the risk of field failures. Unit testing each function independently with valid inputs, boundary cases, and invalid inputs helps detect buffer-related issues early. Automated testing involves feeding random and invalid inputs into the system to uncover crashes and unexpected behavior. Static Analysis Tools like Coverity, Clang Static Analyzer help detect buffer overflows before runtime. Run real-world inputs on embedded hardware to detect issues.

Now that we've explored how to identify, debug, and prevent buffer overflows, it’s clear that these vulnerabilities pose a significant threat to embedded systems. From silent data corruption to catastrophic system failures, the consequences can be severe.

But with the right debugging tools, systematic analysis, and preventive techniques, you can effectively either prevent or mitigate buffer overflows in your systems.

Conclusion

Buffer overflows and corruption are major challenges in embedded systems, leading to crashes, unpredictable behavior, and security risks. Debugging these issues is difficult because their symptoms vary based on system state, requiring systematic analysis using memory map files, GDB/LLDB, and structured debugging approaches.

In this article, we explored:

The causes and consequences of buffer overflows and corruptions
How to debug buffer overflows using memory analysis and debugging tools
Best practices for prevention

Buffer overflow prevention requires a multi-layered approach:

Follow a structured software design process to identify risks early.
Apply defensive programming principles to validate inputs and handle errors gracefully.
Use hardware-based protections like MPUs where available.
Enable compiler flags that help identify memory errors.
Test extensively, unit testing, automated testing, and code reviews help catch vulnerabilities early.

By implementing these best practices, you can minimize the risk of buffer overflows in embedded systems, improving reliability and security.

In embedded systems, where reliability and safety are critical, preventing buffer overflows is not just a best practice, it is a necessity. A single buffer overflow can compromise an entire system. Defensive programming, rigorous testing, and hardware protections are essential for building secure and robust embedded applications.

How to Get a Memory Map of Your System using BIOS Interrupts

Nikolaos Panagopoulos — Mon, 23 Sep 2024 14:14:25 +0000

When you are developing a kernel, one of the most important things is memory. The kernel must know how much memory is available and where it's located to avoid overwriting crucial system resources.

But not all memory is freely available for use. Some memory sections are reserved for system functions and others may be occupied by hardware devices. That’s why it is very important to get the system’s memory map.

What is a Memory Map?

But what is a memory map? A memory map is a representation (think about it like a table) that shows how physical memory is organized in your system. It shows the address of each memory region, it’s length and it’s type.

Type 1 means that the region is available for you to use freely and type 2 means that it is reserved by your system. Type 3 means that the region is reserved for the Advanced configuration and power interface (ACPI 3.x). While a type 3 region might not be used by the system, it can be reclaimed later.

Using a memory map will allow you to manage memory resources successfully without any issues such as crashes or system instability.

There are some ways you can detect your system’s available memory. One is by using the BIOS and interrupt 15h. Another one is by doing memory probing.

In this article you will learn which tools are available to help you get a memory map of your system, which ones you should use, and which ones you should avoid and why. Then finally, you will see some assembly code that you can use in your own bootloader / kernel.

Prerequisites

if you want to follow along with the code shown in this article, you’ll need:

A Linux operating system
Some knowledge of assembly language
A text editor of your choice
An emulator installed. For this example I use QEMU.
FASM assembler installed
Git to be able to clone the repository (https://github.com/nikolaospanagopoulos/memoryMapBoot)

A Few Words about BIOS int 15h

In Real mode, the BIOS offers many interrupts that interact with the hardware and can give you information.

There are some interrupts that can help with getting a memory map, but the most powerful one is int15h with E820h function (hexadecimal numbers! very important to remember. Decimal numbers will not work). This method offers a detailed memory map that you can use to safely determine which areas of memory can be used for vital tasks like setting up paging, memory allocation, and more.

In this article you will see how you can use this interrupt to get a detailed memory map of your system.

Now, before we go deeper, I would like to add a few things about memory probing and why you should avoid it.

Memory probing and why you should avoid it

Memory probing is the process of manually accessing physical memory and determining whether it is available or not. The issue is that not all memory is designed to be accessed directly.

Accessing parts of memory that you shouldn’t can cause unpredictable behavior like:

System Crashes: some memory is reserved for BIOS structures, hardware devices etc. Accessing those areas can lead to system crashes or system instability.
Memory Corruption: accessing reserved memory areas can lead to corruption of those areas. This can cause again crashes, instability, malfunctions etc

So, you should avoid memory probing because it’s an unnecessary risk to your kernel development process.

The Code

Step 1: Prepare to Call int 15h

In this part, you will basically setup the environment needed to invoke int 15h. The general purpose registers need to be stored so that no important data on them is lost during the interrupt invocation. Then the registers bp, ebx are cleared so that they can be set to their initial values.

The “SMAP” value is stored in the edx register to ensure the correct format that the BIOS will return. Finally, we setup the 0xe820 function and request memory map data.

pusha
mov di, 0x0504        ; Set DI register for memory storage
xor ebx, ebx          ; EBX must be 0
xor bp, bp            ; BP must be 0 (to keep an entry count)
mov edx, 0x534D4150   ; Place "SMAP" into edx | The "SMAP" signature ensures that the BIOS provides the correct memory map format
mov eax, 0xe820       ; Function 0xE820 to get memory map
mov dword [es:di + 20], 1 ; force a valid ACPI 3.X entry | allows us to get additional information (extended attributes)
mov ecx, 24           ; Request 24 bytes of data

The pusha command pushed all general purpose registers to the stack to save their values during the interrupt call. They can be restored after the interrupt call to avoid corruption of other areas.
The mov di, 0x0504 instruction sets the di register to 0×0504 (where the memory map entries will be stored).
xor ebx, ebx the xor instruction uses the xor operator to clear the ebx register. It must be set to 0 to start retrieving entries.
xor bp, bp use of the same xor operator here to set bp to 0. This will keep track of your memory entries.
mov edx, 0x534D4150 this instruction will store 0x534D4150 (ASCII string “SMAP”) into the edx register. It makes certain that the BIOS will return the correct format for your memory map.
mov eax, 0xe820 this instruction sets the function 0xe280 which will get the memory map along with int15h.
mov dword [es:di + 20], 1 this instruction forces a valid ACPI (Advanced Configuration and Power Interface) 3.x entry. This way the BIOS provides extra information in the form of extra attributes.
mov ecx, 24 this instruction asks the BIOS for 24 bytes of memory data. This is the size that ACPI 3.x entries need to include extra information.

Step 2: Call int15h

Here, you can finally invoke the interrupt to fetch the memory map. You need to check that the function is supported by the BIOS and that valid data is being fetched. You also need to ensure that the correct format is being fetched by setting again the “SMAP” into the edx register.

    int 0x15                 ; using interrupt
    jc short .failed         ; carry set on first call means "unsupported function"
    mov edx, 0x534D4150      ; Some BIOSes apparently trash this register? lets set it again
    cmp eax, edx             ; on success, eax must have been reset to "SMAP"
    jne short .failed
    test ebx, ebx            ; ebx = 0 implies list is only 1 entry long (worthless)
    je short .failed

int 0x15 this instruction invokes the interrupt 0×15.
jc short .failed is the carry flag that is set. It means the function is unsupported and the call has failed. It jumps to our error handler.
mov edx, 0x534D4150 set again the “SMAP” because some BIOSes corrupt this register after the call.
cmp eax, edx if the call is successfull, on success the BIOS will return the “SMAP” value in eax.
jne short .failed if it doesn’t, it means the call has failed and it jumps to our error handling label.
test ebx, ebx this instruction checks if ebx is 0 after the first call. This means that the memory map only contains one entry. This entry is probably invalid, so it jumps to the error handling label.

Step 3: Loop Through Memory Entries

After a successful first invocation, you need to loop through each entry of the memory map.

In the loop, you will invoke again int 15h to get all subsequent memory entries while checking each entry’s length and other attributes. If it meets the criteria, you increment the counter and you store the entry. This continues until there are no entries left to process.

    jmp short .jmpin
.e820lp:
    mov eax, 0xe820          ; eax, ecx get trashed on every int 0x15 call
    mov dword [es:di + 20], 1 ; force a valid ACPI 3.X entry
    mov ecx, 24              ; ask for 24 bytes again
    int 0x15
    jc short .e820f          ; carry set means "end of list already reached"
    mov edx, 0x534D4150      ; repair potentially trashed register
.jmpin:
    jcxz .skipent            ; skip any 0 length entries (If ecx is zero, skip this entry (indicates an invalid entry length))
    cmp cl, 20               ; got a 24 byte ACPI 3.X response?
    jbe short .notext
    test byte [es:di + 20], 1 ;if bit 0 is clear, the entry should be ignored
    je short .skipent         ; jump if bit 0 is clear 
.notext:
    mov eax, [es:di + 8]     ; get lower uint32_t of memory region length
    or eax, [es:di + 12]     ; "or" it with upper uint32_t to test for zero and form 64 bits (little endian)
    jz .skipent              ; if length uint64_t is 0, skip entry
    inc bp                   ; got a good entry: ++count, move to next storage spot
    add di, 24               ; move next entry into buffer
.skipent:
    test ebx, ebx            ; if ebx resets to 0, list is complete
    jne short .e820lp

.e820lp is a label for looping through each memory map entry.

The next lines are used to call int15h to get the next memory entry:

jc short .e820f if the carry flag is set, it means that we have reached the end of the list.
jcxz .skipent if ecx register is 0, it means the length of the memory entry is invalid. So the code skips it.
cmp cl, 20 checks if the memory entry is a valid ACPI 3.x entry. (It would be 24 bytes long). If it is not, the code jumps to .notext.
test byte [es:di + 20], 1 checks if bit 0 is set in the memory entry's extended attributes, indicating a valid entry. If it's clear, the entry is skipped.
mov eax, [es:di + 8] gets the lower 32 bits of the memory region length and then we combine it using the or operator, with the upper 32 bits. If the total length is 0, then the entry is skipped.
inc bp increments entry count.
add di, 24 moves the pointer di forward to the next memory entry. Each entry is 24 bytes long.

Step 4: End of Memory Entries Handling

Finally, you can store the entry count. And by using the popa instruction, you will restore all general purpose registers to their previous values. If an error occurs during the process, the code jumps to .failed label which is our error handling function.

.e820f:
    mov [mmap_ent], bp       ; store the entry count
    clc                      ; there is "jc" on end of list to this point, so the carry must be cleared

    popa
    ret
.failed:
    stc                      ; "function unsupported" error exit
    ret

mov [mmap_ent], bp stores the entry count.
clc clears the carry flag because it is already set.
popa pops all general purpose registers back from the stack.
.failed we use this label for error handling.

Here is a video from my YouTube account where I implement and explain the above code:

Epilogue

In kernel development, one of the most important tasks is managing memory. The above is a reliable way to detect your system’s memory layout information. This means that you can make safe decisions when allocating resources, implementing paging, and so on.

It might appear to be complex and it maybe is, but if you follow the code line by line you will be able to understand it. These techniques will allow you to build a robust kernel capable of running on different hardware configurations.

Keep Coding!

memory-management - freeCodeCamp.org

ITCM vs DTCM vs DDR: Embedded Memory Types Explained [Full Handbook]

Table of Contents

Prerequisites

Why Embedded Memory Architecture Matters

What is ITCM (Instruction Tightly-Coupled Memory)?

Why Single-Cycle Fetch Matters

What Should Go in ITCM?

How to Place a Function in ITCM

How Much ITCM is Typical?

What is DTCM (Data Tightly-Coupled Memory)?

What Kind of Data Belongs in DTCM?

How to Place Data in DTCM

DTCM Fills Up Faster Than ITCM

A Concrete Example of the Performance Impact

What is DDR (Double Data Rate) Memory?

How DDR Access Works

Why DDR is Necessary

What Belongs in DDR?

How to Place Code and Data in DDR

How They Compare: A Side-by-Side Overview

The Memory Map

How to Decide Where to Place Code and Data

How the Linker Script Controls Memory Placement

How Section Matching Works

Common Mistakes to Avoid

1. Stack Overflow in DTCM

2. Overfilling ITCM

3. Ignoring Alignment Requirements

4. DMA Transfers to TCM on Incompatible Bus Architectures

Performance Comparison With Real Numbers

How TCM Affects Power Consumption

How to Profile Memory Usage

Method 1: The Linker Map File

Method 2: Parsing the Map File for Per-Module Breakdown

Method 3: The size Command

Method 4: Runtime Stack Profiling

Method 5: Tracking Memory Across Builds

Method 6: Automated Memory Budget Checks in CI

Method 7: Heap Tracking at Runtime

Summary

Understanding Escape Analysis in Go – Explained with Example Code

Table of Contents

Prerequisites

Do You Really Need to Care About Escape Analysis?

Memory Layout and Lifecycle

Goroutine Stacks and Stack Frames

Pointers and Lifetime

Sharing Down and Sharing Up

Sharing Down

Sharing Up

Heap, garbage collection, and lifetime

Escape Analysis in Practice

How to Use Escape Analysis to Guide Performance

Conclusion

Further Reading

Embedded Swift: A Modern Approach to Low-Level Programming

Prerequisites

Scope

Table of Contents:

What is Swift? What is Embedded Swift?

Key Features of Swift

Memory Safety via ARC (Covered in detail later):

Swift Programming Model

Protocol-Oriented Programming (POP)

Why POP Matters for Embedded Systems

Swift Memory Management

How ARC works

ARC Overhead in Embedded Systems

Memory Overhead:

CPU Overhead:

Type Safety and Error Prevention

Memory and Instruction Cycle Comparison

Memory Management:

Instruction Cycle Analysis

Instruction Count Comparison: Swift vs C Loop Performance

How to Setup Embedded Swift

Prerequisites

Install Swift Development Snapshot

macOS Installation:

Method 3: The `size` Command