Source:-infoq.com
The Apache Software Foundation (ASF) recently announced that SINGA, a framework for distributed deep-learning, has graduated to top-level project (TLP) status, signifying the project’s maturity and stability. SINGA has already been adopted by companies in several sectors, including banking and healthcare.
Originally developed at the National University of Singapore, SINGA joined ASF’s incubator in March 2015. SINGA provides a framework for distributing the work of training deep-learning models across a cluster of machines, in order to reduce the time needed to train the model. In addition to its use as a platform for academic research, SINGA has been used in commercial applications by Citigroup and CBRE, as well as in several health-care applications, including an app to aid patients with pre-diabetes.
The success of deep-learning models has been driven by the use of very large datasets, such as ImageNet with hundreds of thousands of images, and complex models with millions of parameters. Google’s BERT natural-language model contains 300 million parameters and is trained on nearly 3 billion words. However, this training often requires hours, if not days, to complete. To speed up this process, researchers have turned to parallel computing, which distributes the work across a cluster of machines. According to Professor Beng Chin Ooi, leader of the research group that developed SINGA:
It is essential to scale deep learning via distributed computing as…deep learning models are typically large and trained over big datasets, which may take hundreds of days using a single GPU.
There are two broad parallelism strategies for distributed deep-learning: data parallelism, where multiple machines work on different subsets of the input data, and model parallelism, where multiple machines train different sections of the neural-network model. SINGA supports both of these strategies, as well as a combination of the two. These strategies do introduce some communication and synchronization overhead, required to coordinate the work among the machines in the cluster. SINGA implements several optimizations to minimize this overhead.
Acceptance as a top-level project means that SINGA has passed several milestones related to software quality and community, which in theory makes the software more attractive as a solution. However, one possible barrier to adoption is that instead of building upon an existing API for modeling neural networks, such as Keras, SINGA’s designers chose to implement their own. By contrast, the Horovod framework open-sourced by Uber allows developers to port existing models written for the two most popular deep-learning frameworks, TensorFlow and PyTorch. PyTorch in particular is the framework used in a majority of recent research papers.
ASF has several other top-level distributed-data processing projects that support machine-learning, including Spark and Ignite. Unlike these, SINGA is designed specifically for deep-learning’s large models. ASF is also home to MXNet, a deep-learning framework similar to TensorFlow and PyTorch, which is still in incubator status. AWS touted MXNet as its framework of choice in late 2016, but MXNet still hasn’t achieved widespread popularity, hovering at just under 2% in KDNugget’s polls.
Apache SINGA version 2.0 was released in April, 2019. The source code is available on GitHub, and a list of open issues can be tracked in SINGA’s Jira project. According to ASF, upcoming features include “SINGA-lite for deep learning on edge devices with 5G, and SINGA-easy for making AI usable by domain experts (without deep AI background).