I came across a paper today describing the use of Large-Language-Models without the use of matrix multiplication. It’s a pretty big deal, because it undermines Nvidia’s whole competitive advantage: their ability to build GPU’s that multiply matrices at insane speeds. Here are my “pre-read” thoughts.
The first question that arises is, “how far back can we really scale this”. Stripping away the requirement of linear algebra is already amazing, but how far back can we keep removing computational requirements without completely getting rid of the architecture itself?
Another question is “what are the drawbacks?” The brief summary I read claims performance on-par with typical transformers, but surely there is a catch? It doesn’t require the same amount of memory or compute, so what is the parameter that goes up in response to these advantages?
Other questions will come as I read this paper, but regardless it’s insanely cool.
The world is moving a little too fast, even for me
-Pardan