In this post I will explain how new/delete
vs malloc/free
and std::vector
interact with the CPU’s memory hierarchy and why one is faster or slower based on Ulrich Drepper’s memory guide.
operator new vs malloc
- How they work
When you write:
int* p = new int;
it calls
operator new(sizeof(int))
(typically a thin wrapper aroundmalloc
), then runs any constructors. In contrast:int* p = (int*)malloc(sizeof(int));
goes straight to the C allocator without constructors.
- Allocator metadata & cache misses
Both allocators maintain metadata (freelists, bins, boundary tags) in DRAM. Every allocation or free often misses in L1/L2 and costs hundreds of cycles to reach main memory.
- Locking & system calls
Modern allocators serialize with locks. Contention causes cache line bouncing between cores. If the heap grows,
sbrk
ormmap
system calls add microseconds of overhead which is far more than a few extra instructions. - Fragmentation & TLB pressure
Many small allocations scatter your working set across pages, increasing TLB misses (~200–300 cycles each). Bulk or stack allocations keep your working set tight and TLB friendly.
- Example benchmark snippet
#include <iostream> #include <chrono> #include <cstdlib> constexpr size_t N = 1'000'000; int main() { using namespace std::chrono; // new/delete auto t0 = high_resolution_clock::now(); for (size_t i = 0; i < N; ++i) { int* p = new int(42); delete p; } auto t1 = high_resolution_clock::now(); std::cout << "new/delete: " << duration_cast
(t1 - t0).count() << " ms\n"; // malloc/free auto t2 = high_resolution_clock::now(); for (size_t i = 0; i < N; ++i) { int* p = (int*)malloc(sizeof(int)); *p = 42; free(p); } auto t3 = high_resolution_clock::now(); std::cout << "malloc/free: " << duration_cast (t3 - t2).count() << " ms\n"; }
std::vector Growth: push_back vs reserve
- Without reserve
Each time capacity is exhausted,
vector
doubles its buffer:- Allocate new block (calls
malloc
). - Copy all existing elements (cache misses, eviction).
- Free old block (metadata hit again).
- Allocate new block (calls
- With reserve
By calling
reserve(N)
up front:- One single allocation cost.
- No repeated copies or frees.
- Excellent spatial locality—neighbors stay in the same cache lines.
- Example benchmark snippet
#include <iostream> #include <vector> #include <chrono> constexpr size_t N = 1'000'000; int main() { using namespace std::chrono; // no reserve auto s0 = high_resolution_clock::now(); std::vector
v1; for (size_t i = 0; i < N; ++i) v1.push_back(i); auto s1 = high_resolution_clock::now(); std::cout << "push_back (no reserve): " << duration_cast (s1 - s0).count() << " ms\n"; // with reserve auto s2 = high_resolution_clock::now(); std::vector v2; v2.reserve(N); for (size_t i = 0; i < N; ++i) v2.push_back(i); auto s3 = high_resolution_clock::now(); std::cout << "push_back (with reserve): " << duration_cast (s3 - s2).count() << " ms\n"; }
Compute vs Memory Gap
Memory latency dwarfs compute:
[ L1 cache ] 1–4 cycles [ L2 cache ] ~10 cycles [ L3 cache ] ~30–50 cycles [ Main memory ] ~200–300 cycles
Even a perfectly inlined 10‑cycle function can’t compete with a single DRAM miss (~250 cycles).
Reducing allocator metadata churn, lock contention, and realloc‑and‑copy steps saves far more time than micro‑tweaks to instruction streams.
Practical Recommendations
- Use bulk allocations (e.g.
reserve
) instead of many small ones. - Prefer stack or arena allocators for temporary data.
- Batch small allocations to minimize fragmentation and lock contention.
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Mechanism | Overhead | Performance Tips | Best Use Case |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| operator new / delete | • Heap metadata access | • Batch small allocations | • Occasional C++ objects needing ctors |
| | • Constructor/destructor calls | • Consider stack or arena allocators | • Unpredictable allocation points |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| malloc / free | • Heap metadata access | • Use for POD data or C interop | • Raw buffers or C‑style code paths |
| | • Lock contention in multithreaded programs| • Batch via custom arena or pool | • Few, simple allocations |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| vector.push_back (no reserve)| • Repeated realloc + copy | • Call reserve(N) when N is known | • Rapid prototyping or small N |
| | • Cache line & TLB thrash | • Minimize number of growth phases | |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| vector.reserve + push_back | • Single upfront allocation | • Leverage spatial locality | • Large, fixed‑size arrays |
| | • Excellent contiguous layout | • Align to cache‑line (64 B) | • High performance loops with known N |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Stack / Arena Allocators | • No heap metadata overhead | • Use for temporaries; scope‑based lifetimes | • Short lived, high frequency allocations |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Batched Small Allocations | • Fragmentation & TLB pressure | • Group related allocations into pools | • High throughput allocation patterns |
| | • Lock bouncing between cores | • Reduce lock contention with per‑thread arenas | |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+