Why Compiler Engineers Rarely Use Strassen's Algorithm for Fast Matrix Multiplications
Coding the 1969 Paper 'Gaussian Elimination is not Optimal' to Capture Strassen's Original Insight
Quick Summary
We code Strassen’s matrix multiplication algorithm in C from the original 1969 paper and observe why compiler engineers rarely use the algorithm in practice.
Code is available on GitHub.
1.0 Paper Introduction
Strassen’s algorithm introduced in the 1969 (presented in 1968) paper, Gaussian Elimination is not Optimal (Strassen, 1969)1 was the world’s first subcubic matrix multiplication algorithm.
On paper, Strassen’s algorithm improves the complexity of matrix multiplication from O(n3) to O(n2.8074).
However, in practice, Strassen’s algorithm doesn’t live up to the hype because:
Floating point multiplications are more stable than additions and subtractions.
Strassen’s algorithm decreases multiplications in favor of adds and subs, so floating-point errors accumulate super fast.
Strassen’s algorithm is recursive, not iterative so vectorization is difficult.
Only Haskell-lovers (that’s a slur in my books) know how to vectorize Strassen’s algorithm.
The algorithm demands one finetune some hyperparameters for different computers and matrix sizes.
My computer’s version of Strassen may run slower on your computer.
The algorithm works on square matrices whose dimensions are powers of 2. So we need to pad (increase the size of the original matrix) to the closest power of 2 to runs Strassen’s algorithm.
These drawbacks guide our thesis: as a compiler engineer, the naive O(n3) matmul is faster in practice than Strassen’s O(n2.8074) algorithm.
1.1 Naive matrix multiplication and Cubic Complexity
This section covers the x exponent used to describe matmul complexity in O(nx).
In a nutshell, naive matrix multiplication is considered O(n3) because it performs about O(n3) when multiplying two n×n matrices.
Consides a single matrix entry, cij. We perform n multiplcations and n-1 additions to find this entry. So the work per entry is O(n).
The whole matrix has n rows and n columns. Therefore, there are n2 entries to compute. Since each of the entries cost O(n) then for the entire matrix, the total compute cost is:
If you’re a programmer then it’s best to visualize the cubic complexity as the three nested for loops of a matmul:
1.2 Strassen’s Algorithm
*Note Strassen’s algorithm is designed for square matrices whose dimensions are a power of 2
In 1969, Strassen observed that one needs only 7 multiplications as opposed to 8 when performing a matmul (He & Williams, 2023)2.
The complexity of Strassen’s algorithm is derived from analyzing the complexity of recursion needed to compute large matrices where n is a power of 2.
In practice, Strassen’s algorithm is implemented via matrix partitions and recursion.
One needs helper functions for element-wise addition and subtraction:
Now we implement Strassen’s recursive algorithm.
First, we write a base case to end recursion, plus we allocate memory for Strassen’s temporary variables:
Next we compute the algorithm’s constants:
We proceed to perform Strassen’s subtractions, additions and recursive calls:
Finally, we combine our results and free memory:
2.0 Comparing Naive Matmul and Strassen’s Algorithm
This section compares accuracy and speed. Using this function where epsilon is our error margin.
2.1 Integer vs Floating-Point Accuracy
We set our DataType macro to int or float using our macro:
For integers, Strassen’s algorithm with epsilon equals 1e-7 works great!
However, for floating point, larger matrices demand lower error margins.
For instance, for 512*512 matmuls, only 1 decimal place could be correct:

Section conclusion: Strassen’s algorithm experiences extreme floating-point error accumulation compared to naive matrix multiplication.
2.2 Speed Comparison
Strassen’s algorithm should be faster in practice, right? Wrong. For a 512*512 matrix, our implementation takes 10 seconds while naive matmul take 0.9 seconds.
Strassen’s speed depends on the silliest hyperparameters. For instance, our recursion’s naive base case is pretty awful:
Let’s stop recursion at matrices of dimension 64 instead:

Two rather unintuitive observations occur to us:
Upto dimension ~64, it’s much faster to perform a naive matmul than use Strassen’s algorithm (which should be faster on paper).
Exiting Strassen’s recursion early improves accuracy and makes it faster than naive matmul.
3.0 Conclusion
These observations guide our conclusion: compiler engineers are better off writing a naive matmul that using Strassen’s algorithm.
You might also enjoy other papers in our Computational Arithmetic for Compiler Engineers series:

















