HPC | IT is fun

When implementing $O(N^3)$ operations like matrix multiplication in C++, computational complexity is only half the picture. The physical architecture of the CPU memory hierarchy dictates execution speed just as much as the algorithm itself. By analyzing how the CPU interacts with the L1 cache, we can optimize memory access patterns to yield massive performance gains without changing the underlying mathematical logic. Understanding the L1 Cache To understand the optimization, you need a baseline understanding of CPU caching....