A Talk with Leana Golubchik
TITLE: Deconstructing Distributed Deep Learning
Deep learning has made substantial strides in computer vision, speech recognition, natural language processing, and other applications. New training techniques, larger datasets, increased computing power, and easy-to-use machine learning frameworks (such as TensorFlow, PyTorch or Caffe) all contribute to this success. An important missing piece is that deep learning frameworks do not assist the user with provisioning and sharing cloud resources, or with the integration of DNN training workloads into existing datacenters. Most users need to try different configurations of a job (such as number of server/worker nodes, mini-batch size, network capacity) to determine the resulting training performance (throughput measured as examples/second and training accuracy). When resources must be shared among hundreds of jobs, this approach quickly becomes infeasible. At a larger scale, when multiple datacenters need to manage deep learning workloads, different degrees of affinity for their resources create economic incentives to collaborate, as in cloud federations. In this talk, we present recent models to predict performance metrics (such as training throughput) and scheduling algorithms that use these metrics to guide resource allocation. We also outline the economic incentives for resource sharing for such workloads, and future research goals to broaden the population of users capable of discovering deep learning models and applying them to novel applications.