9 classic papers on distributed machine learning systems; Deep learning the golden decade of hardware | AI frontier dynamic…

9 classic papers on distributed machine learning systems; Deep learning the golden decade of hardware | AI frontier dynamic…

“AI System Frontier” mainly recommends AI systems, compilers, big models, hardware.

1. Nvidia Chief Scientist: The past, present and future of deep learning hardware

How exactly does deep learning hardware continue to improve performance? Nvidia Chief scientist Bill Dally is no one’s authority on this question. In a talk he gave before the H100 GPU was released, he reviewed the state of deep learning hardware. In his opinion, we’ve played our cards right, which means we have to start developing new technologies.

Here are four areas he thinks are worth focusing on: first, new numerical representations, such as Log numbers, and more ingenious quantization schemes than EasyQuant; Secondly, continue to study sparsity; Then, the memory circuit and communication circuit are studied. Finally, improve the existing process technology.

Link: https://mp.weixin.qq.com/s/ofWCgv-ksJjqGH0qlbW8nw

2. Large Scale Distributed Deep Networks

Link: https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf

Author: Paul Tucker, Ke Yang, Quoc V. Le, Andrew Y. Ng

Published: 2012

Abstract: Recent advances in unsupervised feature learning and deep learning have shown that large models can greatly improve performance if they are successfully trained. This paper discusses how to train deep neural networks with billions of parameters using tens of thousands of CPU cores. The authors developed the DistBelief software framework, which can train large models using computing clusters with thousands of machines.

The authors also developed two algorithms for large-scale distributed training for the framework: (i) Downpour SGD, an asynchronous stochastic gradient descent method that supports large-scale model replication; (ii) Sandblaster — a framework that supports multiple distributed batch optimizers, including the distributed implementation of L-BFGS. Both Downpour SGD and Sandblaster L-BFGS can increase the scale and speed of deep neural network training.

In this paper, we successfully trained a deep neural network with a scale 30 times larger than that described in previous literatures, and achieved SOTA performance in the ImageNet object recognition task, which contains 16 million images and involves 21,000 kinds of objects. In addition, the author applies the above method to commercial speech recognition services, and the practice proves that the above method can also help speed up smaller scale deep neural networks. Although the above method is mainly aimed at large-scale neural network training, its underlying algorithm is applicable to all gradient-based machine learning algorithms.

3. Deep Learning with COTS HPC systems

Link: http://proceedings.mlr.press/v28/coates13.pdf

Authors: Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, Andrew Ng

Published: 2013

Abstract: Vertical expansion of deep learning algorithm can improve the performance of the algorithm in benchmarking tasks and enable the algorithm to discover more complex advanced features. Recent training for very large neural networks (with more than a billion participants) relies on cloud-like computing infrastructures and requires thousands of CPU cores.

This paper introduces the technical details and performance of the system based on COTS HPC technology. Using GPU server cluster, Infiniband interconnection and MPI interface, the system can complete the training of 1 billion parameter neural network with only 3 machines in a few days. In addition, a neural network with more than 11 billion parameters can be trained with only 16 machines. Since small-scale computing clusters are easier to popularize, this technique allows more practitioners to participate in the research of very large neural networks.

(In this context, scale up refers to the use of Gpus instead of cpus to increase computing power; The opposite is scale out, which means increasing computing power by increasing the number of machines (enlarging the cluster size).

4. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server (SSP parameter server) More effective distributed ml via a stale synchronous parallel parameter server

Link: https://assets.website-files.com/6241e60ecd4aa2049d61387c/6288655114917759436b392e_LightLDA.pdf

Author: Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, Eric P. Xing

Published: 2013

Abstract: This paper proposes a Parameter Server system for distributed machine learning. The system adopts the Stale Synchronous Parallel (SSP) model for calculation, which ensures the accuracy of calculation and maximizes the effective calculation time of worker. The parameter server provides an easy-to-use shared interface for reading/writing values (including parameters and variables) in machine learning models.

In the SSP model, the distributed worker can read the value of the old version from the local cache instead of waiting for the latest version to be read from the central storage. This method can significantly reduce worker’s waiting time and spend more time on calculation. In addition, the SSP model can ensure the accuracy of the machine learning algorithm by limiting the maximum delay time of the old version of the value.

In this paper, the computational accuracy of the SSP model can be proved, and experimental data are provided to show that the SSP model can achieve faster algorithm convergence than the fully synchronous scheme and asynchronous scheme for a variety of different machine learning problems.

Nodes in a system are divided into two categories: parameter server and worker. The parameter server is responsible for storing the parameters of the model, while the worker is responsible for calculating the gradient of the parameters.)

5. Asymptotically Exact, Embarrassingly Parallel MCMC (Asymptotically Exact, Embarrassingly Parallel MCMC)

Link: https://arxiv.org/pdf/1311.4780.pdf

By Willie Neiswanger, Chong Wang, Eric P. Xing

Published: 2014

Abstract: In machine learning, the communication costs associated with meeting synchronization requirements can significantly reduce the speed of parallel algorithms. This paper introduces a parallel Markov chain Monte Carlo (MCMC) algorithm, which divides data into multiple subsets and processes them independently. First, the data is arbitrarily shred and distributed to as many machines as possible. Each machine then performs a posteriorally-distributed sampling of a given subset of data using any of the typical MCMC methods, such as Gibbs sampling. Finally, data samples obtained from each machine are collected to form a posteriori distribution sample of complete data.

The above embarrassingly parallel algorithm enables each machine to run independently based on its own data subset without communication before the final combination stage. It is proved that this algorithm can obtain asymptotically accurate samples. In addition, experimental data are provided to prove that this algorithm can realize parallel burn-in and parallel sampling for a variety of models. (Note:

“Easy parallel” : In parallel subtasks, data of each subtask does not depend on each other and do not communicate with each other.

“burn-in” : At the beginning of sampling, the Markov chain has not reached a stable equilibrium state, so the sampling at this time is called burn-in, and the samples collected in the burn-in process should be discarded.)

6. LightLDA: Big Topic Models on Modest Compute Clusters LightLDA: Big Topic Models on Modest Compute Clusters

Link: https://assets.website-files.com/6241e60ecd4aa2049d61387c/6288655114917759436b392e_LightLDA.pdf

Authors: Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric P. Xing, Tie-yan Liu, Wei-ying Ma

Published: 2015

Abstract: It is generally believed that building large machine learning programs (such as large topic models with trillions of parameters and training examples or deep neural networks) requires industrial clusters containing thousands of nodes, and most machine learning practitioners and academic researchers do not have such hardware conditions. This paper discusses how to construct topic model based on network scale corpus by using smaller clusters.

In this paper, the author uses a small cluster containing only 8 machines and trains a large topic model that can recognize 1 million topics and 1 million words (reference number is 1 trillion) based on a corpus file containing 200 billion token. Even a large cluster containing thousands of nodes has not achieved this model scale.

The distributed strategy proposed in this paper is an example of the underlying model and data parallel programming model of general distributed machine learning framework Petuum, which is implemented based on the Petuum open source system. Experiments show that the above strategies can be used to train various models with small clusters. In addition, compared with other methods, this strategy can reduce the time cost proportionally with the increase of cluster size.

7. SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs (SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs)

Link: https://arxiv.org/pdf/1610.02496.pdf

Authors: Kaiwei Li, Jianfei Chen, Wenguang Chen, Jun Zhu

Published: 2017

Abstract: Latent Dirichlet Allocation (LDA) is a commonly used tool to analyze discrete count data such as text and images. LDA applications typically deal with large data sets and a large number of topics. Although distributed CPU systems have been used, GPU-based systems are preferable due to the higher computing power and memory bandwidth of Gpus. However, existing GPU-based LDA systems cannot support a large number of topics because they use algorithms based on dense data structures whose temporal and spatial complexity increases linearly as the number of topics increases.

This paper proposes a GPU – based LDA system – SaberLDA system. SaberLDA uses sparse sexy knowledge algorithm to make the time complexity of the algorithm have a sublinear relationship with the computational scale, thus making the learning of a large number of topics possible. To solve the sparsity problem, this paper proposes a new data layout, a new sampling kernel based on warp, and an efficient sparse count matrix update algorithm that improves locality, improves GPU harness utilization, and reduces memory footprint.

Experimental data show that SaberLDA can learn 10,000 topics from data with a word-block size of billions, which is two orders of magnitude more than previous GPU-based systems can learn. Using a single GPU, SaberLDA can learn 10,000 topics in a matter of hours from a data set containing billions of word blocks, which previously would have required clusters of dozens of machines.

STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning (STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning)

Link: https://assets.website-files.com/6241e60ecd4aa2049d61387c/6288662506854dadbf2cb0a8_STRADS.pdf

8. Authors: Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, Eric P. Xing

Published time: 2016

Abstract: Machine learning algorithms are often used in large-scale data processing. Distributed systems are used to slice and distribute data to different machines, so that each machine can read and update machine learning model parameters. This method is called data parallelism. Another method is called model parallel, that is, model parameters can be shred to carry out non-shared parallel access and update, and parameters can be periodically re-shred to speed up communication. The emergence of model parallel method is due to the following two problems that data parallel cannot solve: (1) Parameters may be interdependent, so simple concurrent update may result in errors, which will slow down the convergence rate and even cause algorithm failures; (2) The convergence rate of each model parameter is not consistent, so a small number of parameters not converging can hinder the completion of the whole machine learning algorithm.

In this paper, a programming method named Scheduling model Parallelism (SchMP) is proposed. This method can efficiently schedule parameter updates according to the interdependence of parameters and the convergence rate of each parameter, so as to accelerate the convergence rate of machine learning algorithms. To support SchMP on a large scale, the STRADS distributed framework was developed. STRADS optimizes the throughput of SchMP programs.

In addition, the author also wrote the following four machine learning applications as SchMP programs and conducted benchmark tests on them: LDA topic model construction, matrix decomposition, sparse least squares (Lasso) regression and sparse logistic regression. SchMP programming allows machine learning to make more progress with each iteration, while STRADS improves iteration throughput. As a result, running SchMP programs on the STRADS framework can deliver performance that exceeds non-model parallel machine learning implementations: for example, SchMP LDA and SchMP Lasso converge 10 and 5 times faster than today’s established benchmarks, respectively.

9. Device Placement Optimization with Reinforcement Learning

Link: https://arxiv.org/pdf/1706.04972.pdf

Author: Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, Jeff Dean

Published: 2017

Abstract: In recent years, the scale and computing power requirements of neural network training and reasoning have been improved. To meet computing power requirements, mixed hardware devices (such as cpus and Gpus) are often used to form heterogeneous distributed environments. However, human experts often make decisions based on simple heuristics and intuitive knowledge about how to lay out different parts of a neural network model on different devices.

In this paper, a device layout optimization method based on TensorFlow graph is proposed. The core of this method is to use sequence-to-sequence model to predict which device should run each operation subset in the TensorFlow calculation graph. After the prediction is made by the model, the actual execution time of the layout will be used as a feedback signal for parameter optimization of the sequence-to-sequence model.

The experimental results show that the proposed method can complete complex device layout, whether it is using the Inception-V3 convolutional neural network to perform ImageNet recognition task, running RNN LSTM, language model construction and neural machine translation. The calculation speed of the layout is faster than that of the artificial design heuristic method and the traditional algorithm method.

10. OneFlow: Redesign the Distributed Deep Learning Framework from Scratch OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Link: https://arxiv.org/abs/2110.15032

Author: Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu, Haoran Zhang, Jie Zhao

Published date: 2021

Abstract: OneFlow is a new generation of open source distributed deep learning framework based on SBP (split, broadcast and partial) abstract and Actor models. It supports various parallel paradigms such as data parallelism, model parallelism and pipeline parallelism. Compared to existing frameworks such as PyTorch, OneFlow’s SBP makes data and model parallel programming easier, while its Actor model provides a simple runtime mechanism to manage the complex dependencies imposed by resource constraints, data handling, and computing in distributed deep learning. Case studies and extensive experiments demonstrate that OneFlow is flexible, efficient, and extensible for training a variety of large DNN models, superior to many existing frameworks.

11. Colossal’s team teamed up with Biotopenology’s open source xTrimo Multimer

Recently, the Colossal – AI team (https://github.com/hpcaitech/ColossalAI) joint best figure this success to accelerate protein monomer and complex structure prediction, The xTrimo Multimer model, which supports Monomer and complex structure prediction, is free and open source. Compared with existing solutions, the inference speed is up to 11 times higher!

Link: https://mp.weixin.qq.com/s/RSk_y2AIDUHtM8wVCklPOQ

12. MindSpore Custom Operators: Thinking, Challenges and Practice

A deep learning full stack solution will always encounter the vertical barrier of different operator optimization schemes and the horizontal barrier of graph abstraction separation. In order to solve the expression problem introduced by the above gap, MindSpore puts forward its own solution, namely unified self-defined operator expression.
— — — — — — — —
Copyright Notice: This article is an original article BY CSDN blogger OneFlow Deep Learning Framework. It is reproduced under the CC 4.0 by-SA copyright agreement. Please attach the source link and this notice.
The original link: https://blog.csdn.net/OneFlow_Official/article/details/126595666