In this post I will explain how new/delete vs malloc/free and std::vector interact with the CPU’s memory hierarchy and why one is faster or slower based on Ulrich Drepper’s memory guide.
operator new vs malloc
- How they work
When you write:
int* p = new int;it calls
operator new(sizeof(int))(typically a thin wrapper aroundmalloc), then runs any constructors. In contrast:int* p = (int*)malloc(sizeof(int));goes straight to the C allocator without constructors.
- Allocator metadata & cache misses
Both allocators maintain metadata (freelists, bins, boundary tags) in DRAM. Every allocation or free often misses in L1/L2 and costs hundreds of cycles to reach main memory.
- Locking & system calls
Modern allocators serialize with locks. Contention causes cache line bouncing between cores. If the heap grows,
sbrkormmapsystem calls add microseconds of overhead which is far more than a few extra instructions. - Fragmentation & TLB pressure
Many small allocations scatter your working set across pages, increasing TLB misses (~200–300 cycles each). Bulk or stack allocations keep your working set tight and TLB friendly.
- Example benchmark snippet
#include <iostream> #include <chrono> #include <cstdlib> constexpr size_t N = 1'000'000; int main() { using namespace std::chrono; // new/delete auto t0 = high_resolution_clock::now(); for (size_t i = 0; i < N; ++i) { int* p = new int(42); delete p; } auto t1 = high_resolution_clock::now(); std::cout << "new/delete: " << duration_cast(t1 - t0).count() << " ms\n"; // malloc/free auto t2 = high_resolution_clock::now(); for (size_t i = 0; i < N; ++i) { int* p = (int*)malloc(sizeof(int)); *p = 42; free(p); } auto t3 = high_resolution_clock::now(); std::cout << "malloc/free: " << duration_cast (t3 - t2).count() << " ms\n"; }
std::vector Growth: push_back vs reserve
- Without reserve
Each time capacity is exhausted,
vectordoubles its buffer:- Allocate new block (calls
malloc). - Copy all existing elements (cache misses, eviction).
- Free old block (metadata hit again).
- Allocate new block (calls
- With reserve
By calling
reserve(N)up front:- One single allocation cost.
- No repeated copies or frees.
- Excellent spatial locality, neighbors stay in the same cache lines.
- Example benchmark snippet
#include <iostream> #include <vector> #include <chrono> constexpr size_t N = 1'000'000; int main() { using namespace std::chrono; // no reserve auto s0 = high_resolution_clock::now(); std::vectorv1; for (size_t i = 0; i < N; ++i) v1.push_back(i); auto s1 = high_resolution_clock::now(); std::cout << "push_back (no reserve): " << duration_cast (s1 - s0).count() << " ms\n"; // with reserve auto s2 = high_resolution_clock::now(); std::vector v2; v2.reserve(N); for (size_t i = 0; i < N; ++i) v2.push_back(i); auto s3 = high_resolution_clock::now(); std::cout << "push_back (with reserve): " << duration_cast (s3 - s2).count() << " ms\n"; }
Compute vs Memory Gap
Memory latency dwarfs compute:
[ L1 cache ] 1–4 cycles [ L2 cache ] ~10 cycles [ L3 cache ] ~30–50 cycles [ Main memory ] ~200–300 cyclesEven a perfectly inlined 10‑cycle function can’t compete with a single DRAM miss (~250 cycles).
Reducing allocator metadata churn, lock contention, and realloc‑and‑copy steps saves far more time than micro‑tweaks to instruction streams.
Practical Recommendations
- Use bulk allocations (e.g.
reserve) instead of many small ones. - Prefer stack or arena allocators for temporary data.
- Batch small allocations to minimize fragmentation and lock contention.
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Mechanism | Overhead | Performance Tips | Best Use Case |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| operator new / delete | • Heap metadata access | • Batch small allocations | • Occasional C++ objects needing ctors |
| | • Constructor/destructor calls | • Consider stack or arena allocators | • Unpredictable allocation points |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| malloc / free | • Heap metadata access | • Use for POD data or C interop | • Raw buffers or C‑style code paths |
| | • Lock contention in multithreaded programs| • Batch via custom arena or pool | • Few, simple allocations |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| vector.push_back (no reserve)| • Repeated realloc + copy | • Call reserve(N) when N is known | • Rapid prototyping or small N |
| | • Cache line & TLB thrash | • Minimize number of growth phases | |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| vector.reserve + push_back | • Single upfront allocation | • Leverage spatial locality | • Large, fixed‑size arrays |
| | • Excellent contiguous layout | • Align to cache‑line (64 B) | • High performance loops with known N |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Stack / Arena Allocators | • No heap metadata overhead | • Use for temporaries; scope‑based lifetimes | • Short lived, high frequency allocations |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+
| Batched Small Allocations | • Fragmentation & TLB pressure | • Group related allocations into pools | • High throughput allocation patterns |
| | • Lock bouncing between cores | • Reduce lock contention with per‑thread arenas | |
+------------------------------+--------------------------------------------+------------------------------------------------+--------------------------------------------------+