LowSpecEng 3 hours ago

OP here.

I'm a student working on an older laptop without a dedicated GPU. I wanted to run some neural network experiments that required large matrix multiplications (2000x2000+), but standard numpy.dot (Intel MKL) was becoming a bottleneck for my iteration time.

I realized that for my specific inference tasks (and fuzzy clustering), I didn't need perfect FP32 precision.

I wrote a C99 kernel (wrapped in Python via ctypes) that uses Monte Carlo Outer-Product Sampling. Instead of computing the full N^3 product, it samples columns based on a uniform distribution. It uses OpenMP for parallelism and is optimized for L1/L2 cache locality.

The Result: On my i5 machine, it gets ~4.1x speedup over NumPy with about 5-10% error (configurable via sampling rate).

It's obviously not for scientific simulation or finance, but for stochastic ML approaches, it feels like a free hardware upgrade.

The binary is in the repo if you want to test the speedup. I'm curious if this approach (probabilistic BLAS) is used in production anywhere else.