In tests on an Amazon search dataset that included some 70 million queries and more than 49 million products, Shrivastava, Medini and colleagues showed their approach of using "merged-average classifiers via hashing," (MACH) required a fraction of the training resources of some state-of-the-art commercial systems.
Shrivastava, an assistant professor of computer science at Rice said: "Our training times are about 7-10 times faster, and our memory footprints are 2-4 times smaller than the best baseline performances of previously reported large-scale, distributed deep-learning systems. Product search is challenging, in part, because of the sheer number of products. There are about a million English words, for example, but there are easily more than 100 million products online."
MACH takes a different approach than current training algorithms. Shrivastava said it is more of a thought experiment randomly dividing the 100 million products into three classes, which take the form of buckets.
"I'm mixing, let's say, iPhones with chargers and T-shirts all in the same bucket. It's a drastic reduction from 100 million to three."
In the thought experiment, the 100 million products are randomly sorted into three buckets in two different worlds, which means that products can wind up in different buckets in each world. A classifier is trained to assign searches to the buckets rather than the products inside them, meaning the classifier only needs to map a search to one of three classes of product."
In their experiments with Amazon's training database, Shrivastava, Medini and colleagues randomly divided the 49 million products into 10,000 classes, or buckets, and repeated the process 32 times. That reduced the number of parameters in the model from around 100 billion to 6.4 billion. And training the model took less time and less memory than some of the best-reported training times on models with comparable parameters, including Google's Sparsely-Gated Mixture-of-Experts (MoE) model, Medini said.
He said MACH's most significant feature is that it requires no communication between parallel processors. In the thought experiment, that is what's represented by the separate, independent worlds.