Let’s talk about CPU speed, practically. By practically, I mean, how fast can your CPU do linear algebra operations. And by linear algebra operations, I mean matrix-matrix multiplies.
First, you need to calculate how many FLOPS your computer can do. The following formula comes in handy:
\[
\text{nFLOPS} = \text{cores} \cdot \frac{\text{clock cycles}}{\text{second}} \cdot \frac{\text{FLOPS}}{\text{cycle}}.
\]
You probably already know the number of cores in your computer, and the number of clock cyles. The interesting thing here is the number of FLOPS per cycle: this depends on the architecture of your CPU and what exactly you take to be the size of a float.
It’s standard to take a float to consist of 32 bits, so the number of FLOPS per cycle depends on how many multiples of 32 bits can fit into your registers. SSE capable CPUs have 128 bit registers, so can do 4 FLOPS per cycle (this is the most common set of CPUs). AVX capable CPUs have 256 bit registers, so can do 8 FLOPS per cycle (e.g. the latest Macbook Pros are AVX capable).
Putting these bits together, I get that my workstation, which has 2 hexa-core SSE-capable CPUS each running at 2 GHz achieves
\[
\text{nFLOPS} = (2*6) * (2*10^9)*4 = 96 \text{GFLOPS}.
\]
The cost of a matrix-matrix multiply of two \(n\)-by-\(n\) matrices is essentially \(n^3\) floating point operations. Thus it should take this workstation about \(\frac{n^3}{96} * 10^{-9}\) seconds to do this multiply.
E.g., in my case, the predicted time of squaring two \(16\text{K} \times 16\text{K}\) matrices is about 42.6 seconds. A quick Matlab check shows it does take about 43 seconds.