Basic Memory and Allocation Guide for C++

In this post I will explain how new/delete vs malloc/free and std::vector interact with the CPU’s memory hierarchy and why one is faster or slower based on Ulrich Drepper’s memory guide.

operator new vs malloc

How they work
When you write:
```
int* p = new int;
```
it calls operator new(sizeof(int)) (typically a thin wrapper around malloc), then runs any constructors. In contrast:
```
int* p = (int*)malloc(sizeof(int));
```
goes straight to the C allocator without constructors.
Allocator metadata & cache misses
Both allocators maintain metadata (freelists, bins, boundary tags) in DRAM. Every allocation or free often misses in L1/L2 and costs hundreds of cycles to reach main memory.
Locking & system calls
Modern allocators serialize with locks. Contention causes cache line bouncing between cores. If the heap grows, sbrk or mmap system calls add microseconds of overhead which is far more than a few extra instructions.
Fragmentation & TLB pressure
Many small allocations scatter your working set across pages, increasing TLB misses (~200–300 cycles each). Bulk or stack allocations keep your working set tight and TLB friendly.

Example benchmark snippet

#include <iostream>
#include <chrono>
#include <cstdlib>

constexpr size_t N = 1'000'000;

int main() {
    using namespace std::chrono;

    // new/delete
    auto t0 = high_resolution_clock::now();
    for (size_t i = 0; i < N; ++i) {
        int* p = new int(42);
        delete p;
    }
    auto t1 = high_resolution_clock::now();
    std::cout << "new/delete: "
              << duration_cast(t1 - t0).count()
              << " ms\n";

    // malloc/free
    auto t2 = high_resolution_clock::now();
    for (size_t i = 0; i < N; ++i) {
        int* p = (int*)malloc(sizeof(int));
        *p = 42;
        free(p);
    }
    auto t3 = high_resolution_clock::now();
    std::cout << "malloc/free: "
              << duration_cast(t3 - t2).count()
              << " ms\n";
}

std::vector Growth: push_back vs reserve

Without reserve
Each time capacity is exhausted, vector doubles its buffer:
1. Allocate new block (calls malloc).
2. Copy all existing elements (cache misses, eviction).
3. Free old block (metadata hit again).
With reserve
By calling reserve(N) up front:
- One single allocation cost.
- No repeated copies or frees.
- Excellent spatial locality, neighbors stay in the same cache lines.

Example benchmark snippet

#include <iostream>
#include <vector>
#include <chrono>

constexpr size_t N = 1'000'000;

int main() {
    using namespace std::chrono;

    // no reserve
    auto s0 = high_resolution_clock::now();
    std::vector v1;
    for (size_t i = 0; i < N; ++i) v1.push_back(i);
    auto s1 = high_resolution_clock::now();
    std::cout << "push_back (no reserve): "
              << duration_cast(s1 - s0).count()
              << " ms\n";

    // with reserve
    auto s2 = high_resolution_clock::now();
    std::vector v2;
    v2.reserve(N);
    for (size_t i = 0; i < N; ++i) v2.push_back(i);
    auto s3 = high_resolution_clock::now();
    std::cout << "push_back (with reserve): "
              << duration_cast(s3 - s2).count()
              << " ms\n";
}

Compute vs Memory Gap

Memory latency dwarfs compute:

[   L1 cache   ] 1–4 cycles
[   L2 cache   ] ~10 cycles
[   L3 cache   ] ~30–50 cycles
[ Main memory ] ~200–300 cycles

Even a perfectly inlined 10‑cycle function can’t compete with a single DRAM miss (~250 cycles).
Reducing allocator metadata churn, lock contention, and realloc‑and‑copy steps saves far more time than micro‑tweaks to instruction streams.

Practical Recommendations

Use bulk allocations (e.g. reserve) instead of many small ones.
Prefer stack or arena allocators for temporary data.
Batch small allocations to minimize fragmentation and lock contention.


+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Mechanism                    | Overhead                                   | Performance Tips                                 | Best Use Case                                  |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| operator new / delete        | • Heap metadata access                     | • Batch small allocations                        | • Occasional C++ objects needing ctors         |
|                              | • Constructor/destructor calls             | • Consider stack or arena allocators             | • Unpredictable allocation points              |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| malloc / free                | • Heap metadata access                     | • Use for POD data or C interop                  | • Raw buffers or C‑style code paths            |
|                              | • Lock contention in multithreaded programs| • Batch via custom arena or pool                 | • Few, simple allocations                      |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| vector.push_back (no reserve)| • Repeated realloc + copy                  | • Call reserve(N) when N is known                | • Rapid prototyping or small N                 |
|                              | • Cache line & TLB thrash                  | • Minimize number of growth phases               |                                                |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| vector.reserve + push_back   | • Single upfront allocation                | • Leverage spatial locality                      | • Large, fixed‑size arrays                     |
|                              | • Excellent contiguous layout              | • Align to cache‑line (64 B)                      | • High performance loops with known N          |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Stack / Arena Allocators     | • No heap metadata overhead                | • Use for temporaries; scope‑based lifetimes     | • Short lived, high frequency allocations      |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Batched Small Allocations    | • Fragmentation & TLB pressure             | • Group related allocations into pools           | • High throughput allocation patterns          |
|                              | • Lock bouncing between cores              | • Reduce lock contention with per‑thread arenas  |                                                |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+

← Back to blog