Dubbed Linear-Complexity Multiplication (L-Mul) the technique that reduces AI model power consumption by replacing energy-intensive floating-point multiplications with simpler integer additions. This method promises significant energy savings without compromising accuracy but requires specialised hardware to fully realise its benefits.
L-Mul approximates intensive floating-point multiplications so that instead of multiplying 123.45 by 67.89, L-Mul breaks it down into smaller, more manageable steps using addition.
This speeds up the calculations and uses less energy while maintaining accuracy. The results seem promising: "Applying the L-Mul operation in tensor processing hardware can potentially reduce 95 per cent energy cost by element-wise floating point tensor multiplications and 80 per cent energy cost of dot products," the researchers claim.
According to this research, a model using this technique would require 95 per cent less energy to think and 80 per cent less energy to generate new ideas.
The algorithm's impact extends beyond energy savings. In some cases, l-Mul outperforms current 8-bit standards, achieving higher precision while using significantly less bit-level computation.
Tests across natural language processing, vision tasks, and symbolic reasoning showed an average performance drop of just 0.07 per cent -- a negligible tradeoff for the potential energy savings.
Transformer-based models, the backbone of large language models like GPT, could benefit significantly from L-Mul. The algorithm seamlessly integrates into the attention mechanism, a computationally intensive part of these models. Tests on popular models such as Llama, Mistral, and Gemma revealed some accuracy gain on specific vision tasks.
At an operational level, L-Mul's advantages become even more apparent. The research shows that multiplying two float8 numbers (how AI models would operate today) requires 325 operations, while L-Mul uses only 157 -- less than half.
“L-Mul is both more efficient and more accurate than fp8 multiplication," the study concludes. But nothing is perfect, and this technique has a major Achilles heel: It requires a particular type of hardware, so the current hardware isn't optimised to take full advantage of it.
Plans for specialised hardware that natively supports L-Mul calculations may be already in motion.
"To unlock the full potential of our proposed method, we will implement the L-Mul and L-Matmul kernel algorithms on hardware level and develop programming APIs for high-level model design," the researchers say.