Why Compiler Engineers Rarely Use Strassen's Algorithm for Fast Matrix Multiplications

Coding the 1969 Paper 'Gaussian Elimination is not Optimal' to Capture Strassen's Original Insight

Murage Kibicho

Jun 08, 2026

Quick Summary

We code Strassen’s matrix multiplication algorithm in C from the original 1969 paper and observe why compiler engineers rarely use the algorithm in practice.

*Gaussian Elimination is not Optimal* (Strassen, 1969) paper abstract

Code is available on GitHub.

1.0 Paper Introduction

Strassen’s algorithm introduced in the 1969 (presented in 1968) paper, Gaussian Elimination is not Optimal (Strassen, 1969)1 was the world’s first subcubic matrix multiplication algorithm.

On paper, Strassen’s algorithm improves the complexity of matrix multiplication from O(n³) to O(n^2.8074).

However, in practice, Strassen’s algorithm doesn’t live up to the hype because:

Floating point multiplications are more stable than additions and subtractions.
- Strassen’s algorithm decreases multiplications in favor of adds and subs, so floating-point errors accumulate super fast.
Strassen’s algorithm is recursive, not iterative so vectorization is difficult.
- Only Haskell-lovers (that’s a slur in my books) know how to vectorize Strassen’s algorithm.
The algorithm demands one finetune some hyperparameters for different computers and matrix sizes.
- My computer’s version of Strassen may run slower on your computer.
- The algorithm works on square matrices whose dimensions are powers of 2. So we need to pad (increase the size of the original matrix) to the closest power of 2 to runs Strassen’s algorithm.

These drawbacks guide our thesis: as a compiler engineer, the naive O(n³) matmul is faster in practice than Strassen’s O(n^2.8074) algorithm.

1.1 Naive matrix multiplication and Cubic Complexity

This section covers the x exponent used to describe matmul complexity in O(n^x).

The exponent of a matrix multiplication. Taken from (He & Williams, 2023)

In a nutshell, naive matrix multiplication is considered O(n³) because it performs about O(n³) when multiplying two n×n matrices.

Consides a single matrix entry, c_ij. We perform n multiplcations and n-1 additions to find this entry. So the work per entry is O(n).

The whole matrix has n rows and n columns. Therefore, there are n²entries to compute. Since each of the entries cost O(n) then for the entire matrix, the total compute cost is:

If you’re a programmer then it’s best to visualize the cubic complexity as the three nested for loops of a matmul:

Triple nested for loop demonstrates cubic complexity

1.2 Strassen’s Algorithm

*Note Strassen’s algorithm is designed for square matrices whose dimensions are a power of 2

Strassen’s Algorithm. Taken from page 2 of (Strassen, 1969)

In 1969, Strassen observed that one needs only 7 multiplications as opposed to 8 when performing a matmul (He & Williams, 2023)2.

The complexity of Strassen’s algorithm is derived from analyzing the complexity of recursion needed to compute large matrices where n is a power of 2.

Strassen’s Algorithm complexity analysis. Taken from (He & Williams, 2023)

In practice, Strassen’s algorithm is implemented via matrix partitions and recursion.

One needs helper functions for element-wise addition and subtraction:

Element-wise matrix addition and subtraction

Now we implement Strassen’s recursive algorithm.

First, we write a base case to end recursion, plus we allocate memory for Strassen’s temporary variables:

Strassen’s algorithm needs more memory than naive matmuls and is recursive

Next we compute the algorithm’s constants:

We proceed to perform Strassen’s subtractions, additions and recursive calls:

Addition, subtraction and recursive calls in Strassen’s algorithm

Finally, we combine our results and free memory:

2.0 Comparing Naive Matmul and Strassen’s Algorithm

This section compares accuracy and speed. Using this function where epsilon is our error margin.

2.1 Integer vs Floating-Point Accuracy

We set our DataType macro to int or float using our macro:

For integers, Strassen’s algorithm with epsilon equals 1e-7 works great!

However, for floating point, larger matrices demand lower error margins.

As matrix size increases, then we expect fewer correct decimal places

For instance, for 512*512 matmuls, only 1 decimal place could be correct:

Naive matmul and Strassen’s comparison for 512*512 matmuls. Only one decimal place is correct for Strassen’s

Section conclusion: Strassen’s algorithm experiences extreme floating-point error accumulation compared to naive matrix multiplication.

2.2 Speed Comparison

Strassen’s algorithm should be faster in practice, right? Wrong. For a 512*512 matrix, our implementation takes 10 seconds while naive matmul take 0.9 seconds.

Strassen is 10 times slower and low precision when implemented out-of-the-box

Strassen’s speed depends on the silliest hyperparameters. For instance, our recursion’s naive base case is pretty awful:

Naive base case for Strassen’s algorithm makes it 10 times slower than naive matmul

Let’s stop recursion at matrices of dimension 64 instead:

New recursion base case makes Strassen 2 times faster than naive matmul and improves accuracy by a lot

Two rather unintuitive observations occur to us:

Upto dimension ~64, it’s much faster to perform a naive matmul than use Strassen’s algorithm (which should be faster on paper).
Exiting Strassen’s recursion early improves accuracy and makes it faster than naive matmul.

3.0 Conclusion

These observations guide our conclusion: compiler engineers are better off writing a naive matmul that using Strassen’s algorithm.

You might also enjoy other papers in our Computational Arithmetic for Compiler Engineers series:

References

Strassen, V. (1969). Gaussian elimination is not optimal. Numer. Math. 13, 354–356. DOI.

He, A. & Williams, E. (2023). Computational Complexity of Matrix Multiplication. Cornell CS 6810 Fall 2023. PDF Link.

LeetArxiv

Discussion about this post

Ready for more?