Source: analyticsindiamag.com
There are two fundamentally different and complementary ways of accelerating machine learning workloads:
- By vertical scaling or scaling-up, where one adds more resources to a single machine
- 2. By horizontal scaling or scaling-out, where one adds more nodes to the system
In a recent work published by the researchers at Delft University of Technology, Netherlands, they wrote in detail about the current state-of-the-art distributed ML models and how they affect computation latency and other attributes.
The advantages of using distributed ML models are plenty, it is beyond the scope of this article, however, here we list down of popular toolkits and techniques that enable distributed machine learning:
MapReduce and Hadoop
MapReduce is a framework for processing data and was developed by Google in order to process data in a distributed setting. First, all data is split into tuples during the map phase, which is followed by the reduce phase, where these tuples are grouped to generate a single output value per key. MapReduce and Hadoop heavily rely on the distributed file system in every phase of the execution.
Apache Spark
Transformations in linear algebra, as they occur in many machine learning algorithms, are typically highly iterative in nature and the paradigm of the map and the reduce operations are not ideal for such iterative tasks. This is what Apache Spark has been developed to resolve.
The key difference here is the MapReduce tasks, which would require to write all (intermediate) data to disk for it to be executed. Whereas, Spark can keep all the data in memory, which saves expensive reads from the disk.
Baidu AllReduce
AllReduce uses common high-performance computing technology to iteratively train stochastic gradient descent models on separate mini-batches of the training data. Baidu claims linear speedup when applying this technique in order to train deep learning networks.
Horovod
Horovod like Baidu, adds a layer of AllReduce-based MPI training to Tensorflow. Horovod uses the NVIDIA Collective Communications Library (NCCL) for increased efficiency when training on (Nvidia) GPUs. However, Horovod lacks fault tolerance and therefore suffers from the same scalability issues as those of Baidu’s.
Caffe2
This deep learning framework distributes machine learning through AllReduce algorithms. It does this by using NCCL between GPUs on a single host, and custom code between hosts based on Facebook’s Gloo library.
Microsoft Cognitive Toolkit
This toolkit offers multiple ways of data-parallel distribution. Many of them use the Ring AllReduce tactic as previously described, making the same trade-off of linear scalability over fault-tolerance.