{"title": "BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training", "book": "Advances in Neural Information Processing Systems", "page_first": 4238, "page_last": 4248, "abstract": "In distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML, a new gradient synchronization algorithm with higher network performance and lower network cost than the current practice. BML runs on BCube network, instead of using the traditional Fat-Tree topology. BML algorithm is designed in such a way that, compared to the parameter server (PS) algorithm on a Fat-Tree network connecting the same number of server machines, BML achieves theoretically 1/k of the gradient synchronization time, with k/5 of switches (the typical number of k is 2\u223c4). Experiments of LeNet-5 and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4%.", "full_text": "BML: A High-performance, Low-cost Gradient\nSynchronization Algorithm for DML Training\n\nSongtao Wang1,2, Dan Li1, Yang Cheng1, Jinkun Geng1,\n\nYanshu Wang1, Shuai Wang1, Shutao Xia1,2 and Jianping Wu1\n\n1Department of Computer Science and Technology, Tsinghua University\n\n2Graduate School at Shenzhen, Tsinghua University\n\nAbstract\n\nIn distributed machine learning (DML), the network performance between ma-\nchines signi\ufb01cantly impacts the speed of iterative training. In this paper we pro-\npose BML, a new gradient synchronization algorithm with higher network per-\nformance and lower network cost than the current practice. BML runs on BCube\nnetwork, instead of using the traditional Fat-Tree topology. BML algorithm is de-\nsigned in such a way that, compared to the parameter server (PS) algorithm on a\nFat-Tree network connecting the same number of server machines, BML achieves\n5 of switches (the typi-\ntheoretically 1\ncal number of k is 2\u223c4). Experiments of LeNet-5 and VGG-19 benchmarks on a\ntestbed with 9 dual-GPU servers show that, BML reduces the job completion time\nof DML training by up to 56.4%.\n\nk of the gradient synchronization time, with k\n\n1\n\nIntroduction\n\nMachine learning (ML) has become a core service in large companies [14]. The scale of modern\nML training can be huge [17, 7, 4]. From our survey of a large internet company, a CTR (click\nthrough rate) estimation task trains a model of >100 billion features with >1PB training data. Giv-\nen the memory size and processing capability of today\u2019s commodity machines, it is inevitable to run\ndistributed machine learning (DML) on multiple machines [10, 18]. For instance, the internet com-\npany under survey currently uses several hundreds of dedicated machines to carry out the training\nfor CTR estimation. With the ever-increasing training data and model sizes, it is expected that even\nlarger-scale DML will appear in the near future.\nA typical ML training task trains a model iteratively until the parameters converge. In the widely-\nused gradient descent optimization method, in each iteration the algorithm uses a minibatch of\ntraining data to compute a gradient, which decides the changes to make to the parameters trained\nby the previous iteration. In DML, every machine iteratively trains a sub-minibatch of data and\nsynchronizes the gradients with other machines. Ideally, more machines help reduce the training\ntime. However, it has been shown that, when more machines are used in DML, we have to set a\nsmaller sub-minibatch size per machine, so as to keep the aggregated minibatch over all the machines\nwith a reasonable size. Otherwise, the large aggregated minibatch may cause the training to quickly\nconverge to a worse model. For instance, a recent work from Facebook discloses that their translation\nservice cannot currently train on large minibatches without degrading model quality [14].\nA side effect of smaller sub-minibatch size per machine in larger-scale DML is the break of compu-\ntation/communication balance. For example, an experiment from Amazon shows that [23], if setting\nthe batch size on a GPU as 16, the processing time per batch stays stable from 1 GPU to 128 GPUs;\nwhile if setting the batch size on a GPU as 2, the processing time per batch under 128 GPUs in-\ncreases by more than 6 times compared with the time per batch under a single GPU, because of the\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fdominating communication cost. Therefore, in order to run DML in large scale, we need to carefully\ndesign the network with minimized synchronization overhead among machines.\nThe widely-used network topology to run DML in today\u2019s data centers is Clos network, or Fat-\nTree [5]. Although Fat-Tree achieves great success in providing uniform network performance to\ncloud computing applications, it may not well match the traf\ufb01c model of gradient synchronization\nin DML. Running the typical parameter server (PS) synchronization algorithm in Fat-Tree, each\nsynchronization \ufb02ow needs to traverse multiple hops of switches before being aggregated. It not\nonly hurts the gradient synchronization performance, but also wastes the bandwidth/link resource.\nIn this paper, we suggest using BCube [12] as the underlying network topology for DML training,\nand design a novel distributed gradient synchronization algorithm on top of BCube, called BML.\nBCube is a recursive network topology composed of commodity switches and servers with k (the\ntypical value of k is 2\u223c4) interfaces. The synchronization algorithm of BML is designed in such a\nway that, compared to the PS algorithm running a Fat-Tree network connecting the same number\nof server machines, BML running on a BCube network can theoretically achieve 1\nk of the gradient\nsynchronization time, with only k\nWe have implemented BML in TensorFlow. We run two representative public deep learning bench-\nmarks, namely, LeNet-5 [15] and VGG-19 [22], on a testbed with 9 dual-GPU servers. The exper-\niment results show that, BML can reduce the job completion time of DML training by up to 56.4%\ncompared with the PS algorithm on Fat-Tree network. The advantage of BML is higher when the\nsub-minibatch size per machine is smaller, which is important for large-scale DML to guarantee the\nmodel accuracy.\n\n5 of switches.\n\n2 Background and Motivation\n\nDML Models and Notations: DML can run on multiple CPUs/GPUs in a machine or on multiple\nmachines. In this work we focus on DML network among machines. In order to decouple the inter-\nmachine and intra-machine communications, throughout this paper we simply take one machine as\na single training worker, though the machine can be equipped with multiple GPUs.\nBased on splitting whether the training data or the model parameters onto multiple machines, DML\ncan be divided into data-parallel and model-parallel ones. In data-parallel DML, each machine uses\na shard of training data to compute the gradients; while in model-parallel DML, a machine computes\ngradients for part of the model. In this work we focus on data-parallel DML. In each iteration, every\nmachine trains local gradients for the entire model based on its sub-minibatch of training data, and\nsynchronizes the gradients with other machines. The aggregated gradients are calculated upon all\nthe machines\u2019 local gradients, which are then applied to the model update.\nAccording to the tradeoff between gradient freshness and computing resource utilization, there are\nthree typical synchronization modes. 1) Bulk synchronous parallel (BSP). 2) Total asynchronous\nparallel (TAP). 3) Stale synchronous parallel (SSP). Given a prede\ufb01ned accuracy of the trained\nmodel, it is dif\ufb01cult to tell which synchronization mode runs the fastest in practice. BSP wastes the\ncomputation resource of some faster machines, but fully follows the sequential behavior as trained\nby a single machine. TAP makes full utilization of the computing resource, but the convergence\nspeed is unpredictable with the possibility of no convergence at all [10]. SSP lies between the two\nwith proven convergence [9, 26]. In this work we focus on BSP synchronization, which is widely\nused in modern ML applications [11, 24].\n\nTable 1: Notations used throughout the paper\n\nNotation\nN\nP\nTF\nTC\n\nMeaning\n\nThe total number of servers in a DML network\nThe size of full gradients\nThe theoretical time to transmit the full gradients by full link speed\nThe theoretical time to transmit a gradient piece by full link speed\n\nWe summarize the notations used throughout the paper in Table 1. N denotes the total number\nof server machines in a DML network, P denotes the size of full gradients for the trained model,\nand TF denotes the theoretical time to transmit the full gradients by full link speed. Many gradient\n\n2\n\n\fsynchronization algorithms divide the full gradients into multiple pieces during synchronization, and\nwe use TC to denote the theoretical time to transmit a gradient piece by full link speed.\n\n(a) PS algorithm\n\n(b) Ring AllReduce algorithm\n\nFigure 1: Gradient synchronization algorithms.\n\nGradient Synchronization Algorithm: There are two representative algorithms to synchronize the\ngradients among machines, namely, PS algorithm [17] and Ring AllReduce algorithm [19]. The\nPS algorithm is shown in Fig. 1(a), in which a logical parameter server (composed of a number of\nphysical servers) interacts with every worker for parameter update. Each worker pushes its local\ngradients to the parameter server after training in an iteration, and pulls the aggregated gradients\nfrom the parameter server before going to the next iteration. With more workers in DML, the\namount of traf\ufb01c exchanged with the parameter server also increases.\nThe Ring AllReduce algorithm is widely used in HPC, as shown by Fig. 1(b). If we run the Ring\nAllReduce algorithm for gradient synchronization, all the machines are organized as a logical ring\nand the algorithm includes two stages. In the scatter-allreduce stage, it takes N \u2212 1 steps for each\nN of the gradients; in the allgather stage, it takes N \u2212 1 more steps for each\nmachine to aggregate 1\nN \u2217 TF , if full\nmachine to get a complete set of the updated gradients. Each step takes the time of 1\nlink speed can be used by every machine. The theoretical gradient synchronization time(GST) in\nthe Ring AllReduce algorithm is thus 2\u2217(N\u22121)\n\n\u2217 TF .\n\nN\n\nFigure 2: A Fat-Tree network with 16 servers.\n\nPhysical Network Topology: Fat-Tree is the current practice of physical network topology in most\ncommercial data centers, as shown in Fig. 2. When running the PS algorithm for gradient synchro-\nnization in Fat-Tree, it is \ufb02exible to place the parameter servers and workers. One way is to partition\nthe machines into two clusters, one for parameter servers and the other for workers. Another way\nis to implement the logical parameter server in a P2P manner, i.e., every machine plays as both a\nworker and a parameter server. For reducing the gradient synchronization time, the latter way makes\nbetter utilization of the network bandwidth. In every iteration, each machine is responsible for ag-\ngregating 1\nN of the gradients and broadcasting to all the other machines. Hence, during gradient\nsynchronization each machine pushes P\nN aggregated gradients from,\nevery other machine. Since the Fat-Tree network provides non-blocking bandwidth, the theoretical\nGST for P2P based PS algorithm in Fat-Tree is 2\u2217(N\u22121)\nIf running the Ring AllReduce algorithm in a Fat-Tree network, the theoretical GST is also\n2\u2217(N\u22121)\n\u2217 TF , by utilizing the bidirectional bandwidth of every server link. Since the two gradient\nsynchronizatioin algorithms achieve the same GST in a Fat-Tree network and the PS algorithm is\nmore widely implemented in modern DML frameworks [4, 1, 6], in this paper we only take the PS\nalgorithm in Fat-Tree as the benchmark.\nMotivation of BML Design: Although Fat-Tree achieves great success in providing uniform net-\nwork performance to cloud computing applications, in this paper we argue that Fat-Tree does not\nwell match the traf\ufb01c pattern of DML training. When running the PS algorithm, each synchroniza-\ntion \ufb02ow needs to traverse multiple hops of switches before being aggregated, which not only hurts\n\nN local gradients to, and pulls P\n\n\u2217 TF .\n\nN\n\nN\n\n3\n\nGradient synchronization flowParameter ServerWorkerWorkerWorkerWorkerGradient synchronization flow\fthe gradient synchronization performance, but also wastes the bandwidth/link resource. We seek to\ndesign a new gradient synchronization algorithm on alternative network topology, which can achieve\nless GST with lower hardware cost. The new network topology and synchronization algorithm atop\ncan be used to build a server cluster purposely for DML training.\n\n3 BML Design\n\n3.1 BCube Topology\n\nFigure 3: The topology of BCube(3,2).\n\nWe select BCube [12, 8] as the underlying physical network topology for DML training. BCube(n,k)\nis a recursive topology composed of commodity servers with k (the typical value of k is 2\u223c4)\nnetwork interfaces and switches with n ports. Fig. 3 shows an example of BCube(3,2) topology.\nNote that modern commodity GPU servers used in ML training have multiple PCIe slots. Besides\ninstalling the GPU cards, it is easy to equip k network interfaces on a GPU server. Since in this paper\nwe assume network is the bottleneck rather than computation power, plugging in several NICs rather\nthan GPUs into PCIe slots are reasonable. BCube switches are organized in k levels (identi\ufb01ed from\n0 to k \u2212 1). Each server uses one interface to connect a switch from one level. A BCube server can\nbe denoted by an ID of s = [vk\u22121..., v1, v0](vi \u2208 [0, n \u2212 1],\u2200i \u2208 [0, k \u2212 1]). The links connecting\nlevel-i switches are called level-i links.\nIt is worth noting that BCube switches are not directly\nconnected with each other. If the shortest path between two BCube servers traverses q switches, the\ntwo servers are called q-hop neighboring servers. For instance, in Fig. 3 each server has four 1-hop\nneighboring servers.\nA BCube(n,k) network contains nk servers and k \u2217 nk\u22121 switches. For instance, a BCube(16,4)\nnetwork has 65536 servers. Hence, BCube can extend to large scale on commodity servers and\nswitches. Compared with the classical 3-layer Fat-Tree topology [5], BCube pays the cost of using\nmore network interfaces on servers. However, the number of switches required in BCube is much\nless than that in Fat-Tree. Considering the situation that many servers in modern data centers are\nequipped with at least dual ports [16], the cost of BCube can be lessened. In order to connect a\ntotal number of N servers by n-port switches, a BCube network needs k\u2217N\nn switches, while a Fat-\nn switches. Given the typical value of k in BCube is 2 \u223c 4, the number of\nTree network needs 5\u2217N\nswitches required in BCube is 40\u223c80% of that in Fat-Tree. Since the cost of a server NIC is much\nless than that of a switch, the total network cost in a BCube network is considerably less than that in\na Fat-Tree network connecting the same number of servers.\n\n3.2 Gradient Synchronization Algorithm\n\nIn every training iteration, servers take a fully-distributed way to synchronize the gradients. As\nillustrated in Table 1, we use N to denote the total number of servers in a BCube(n, k) network,\nwith N = nk. k gradient synchronization threads (each thread identi\ufb01ed by e \u2208 [0, k \u2212 1]) are\nrun simultaneously on each server. The full set of gradients, with the size of P , are equally split\ninto k \u2217 N gradient pieces. The theoretical time to transmit a gradient piece by full link speed is\nthus TC = TF\nk\u2217N . Each synchronization thread e on a server q is responsible for aggregating one\ngradient piece, and thus we can use g =< e, q > to identify a gradient piece. We further use g(S) to\ndenote the gradient piece g that is aggregated on the set of servers S. Obviously, the initial state of a\ngradient piece g on a server s is g(s), and the \ufb01nal state of the gradient piece after synchronization\nis g([\u2217,\u2217, ...,\u2217]).\nEach gradient synchronization thread on a server runs the algorithm in two stages, namely, aggrega-\ntion stage and broadcast stage. In the aggregation stage, the thread exchanges gradient pieces with\nthe same thread on other servers, and aggregates one gradient piece. In the broadcast stage, the\n\n4\n\n[0,0] [0,1] [0,2] [1,0] [1,1] [1,2] [2,0] [2,1] [2,2]Level-1 Level-0\f(a) Initial states of gradient pieces.\n\n(b) States after the \ufb01rst step of aggregation stage.\n\n(c) States after the second step of aggregation stage.\n\n(d) States after the \ufb01rst step of broadcast stage.\n\n(e) States after the second step of broadcast stage.\n\nFigure 4: BML Gradient synchronization algorithms.\n\nthread broadcasts its aggregated gradient piece to all the other servers. Finally, every server gets a\ncomplete set of aggregated gradients, which are used to update the parameters. Both aggregation\nstage and broadcast stage take k steps in each. In what follows we \ufb01rst use an example in BCube(3,2)\nto demonstrate the process of BML algorithm. After that we describe the generalized algorithm.\nAn Example of BML Algorithm in BCube(3,2): As shown by Fig. 4(a), in a BCube(3,2) network\nevery server runs two gradient synchronization threads. At the beginning each server s only has\nthe local gradients trained by itself. These gradients are split into 18 (=9*2) pieces identi\ufb01ed from\n< 0, 0, 0 > (s) to < 1, 2, 2 > (s). The theoretical time to transmit a gradient piece is TC = TF\n18 .\nIn the \ufb01rst step of the aggregation stage, every server exchanges gradient pieces with its 1-hop\nneighboring servers, as shown in Fig.4(b). Take the gradient synchronization thread 0 on server\n[0, 0] as an example. It sends 3 local gradient pieces < 0,\u2217, 1 > ([0, 0]) to server [0, 1] and 3 pieces\nof < 0,\u2217, 2 > ([0, 0]) to server [0, 2], respectively. At the same time, it also receives 3 gradient\npieces of < 0,\u2217, 0 > ([0, 1]) from server [0, 1] and 3 pieces of < 0,\u2217, 0 > ([0, 2]) from server\n[0, 2]. Together with the local gradient pieces of < 0,\u2217, 0 > ([0, 0]), it aggregates 3 gradient pieces\n< 0,\u2217, 0 > based on the 3 servers under the same level-0 switch. The partially-aggregated result\nis < 0,\u2217, 0 > ([0,\u2217]). Each gradient synchronization thread on every server makes similar partial\naggregation. Note that in this step synchronization thread 0 on all servers use level-0 links only\nwhile thread 1 on all servers use level-1 links only. Therefore, it takes the theoretical time of 6 \u2217 TC\nto complete this step.\n\n5\n\nS[0,0]:<0,*,*>([0,0]),<1,*,*>([0,0])S[0,1]:<0,*,*>([0,1]),<1,*,*>([0,1])S[0,2]:<0,*,*>([0,2]),<1,*,*>([0,2])S[1,0]:<0,*,*>([1,0]),<1,*,*>([1,0])S[1,1]:<0,*,*>([1,1]),<1,*,*>([1,1])S[1,2]:<0,*,*>([1,2]),<1,*,*>([1,2])S[2,0]:<0,*,*>([2,0]),<1,*,*>([2,0])S[2,1]:<0,*,*>([2,1]),<1,*,*>([2,1])S[2,2]:<0,*,*>([2,2]),<1,*,*>([2,2])Level-1 Level-0S[0,0]:<0,*,0>([0,*]),<1,0,*>([*,0])S[0,1]:<0,*,1>([0,*]),<1,0,*>([*,1])S[0,2]:<0,*,2>([0,*]),<1,0,*>([*,2])S[1,0]:<0,*,0>([1,*]),<1,1,*>([*,0])S[1,1]:<0,*,1>([1,*]),<1,1,*>([*,1])S[1,2]:<0,*,2>([1,*]),<1,1,*>([*,2])S[2,0]:<0,*,0>([2,*]),<1,2,*>([*,0])S[2,1]:<0,*,1>([2,*]),<1,2,*>([*,1])S[2,2]:<0,*,2>([2,*]),<1,2,*>([*,2])Level-0 links used by thread 0Level-1 links used by thread 1Level-1 Level-0S[0,0]:<0,0,0>([*,*]),<1,0,0>([*,*])S[0,1]:<0,0,1>([*,*]),<1,0,1>([*,*])S[0,2]:<0,0,2>([*,*]),<1,0,2>([*,*])S[1,0]:<0,1,0>([*,*]),<1,1,0>([*,*])S[1,1]:<0,1,1>([*,*]),<1,1,1>([*,*])S[1,2]:<0,1,2>([*,*]),<1,1,2>([*,*])S[2,0]:<0,2,0>([*,*]),<1,2,0>([*,*])S[2,1]:<0,2,1>([*,*]),<1,2,1>([*,*])S[2,2]:<0,2,2>([*,*]),<1,2,2>([*,*])Level-1 Level-0Level-1 links used by thread 0Level-0 links used by thread 1S[0,0]:<0,*,0>([*,*]),<1,0,*>([*,*])S[0,1]:<0,*,1>([*,*]),<1,0,*>([*,*])S[0,2]:<0,*,2>([*,*]),<1,0,*>([*,*])S[1,0]:<0,*,0>([*,*]),<1,1,*>([*,*])S[1,1]:<0,*,1>([*,*]),<1,1,*>([*,*])S[1,2]:<0,*,2>([*,*]),<1,1,*>([*,*])S[2,0]:<0,*,0>([*,*]),<1,2,*>([*,*])S[2,1]:<0,*,1>([*,*]),<1,2,*>([*,*])S[2,2]:<0,*,2>([*,*]),<1,2,*>([*,*])Level-1 Level-0Level-1 links used by thread 0Level-0 links used by thread 1S[0,0]:<0,*,*>([*,*]),<1,*,*>([*,*])S[0,1]:<0,*,*>([*,*]),<1,*,*>([*,*])S[0,2]:<0,*,*>([*,*]),<1,*,*>([*,*])S[1,0]:<0,*,*>([*,*]),<1,*,*>([*,*])S[1,1]:<0,*,*>([*,*]),<1,*,*>([*,*])S[1,2]:<0,*,*>([*,*]),<1,*,*>([*,*])S[2,0]:<0,*,*>([*,*]),<1,*,*>([*,*])S[2,1]:<0,*,*>([*,*]),<1,*,*>([*,*])S[2,2]:<0,*,*>([*,*]),<1,*,*>([*,*])Level-1 Level-0Level-0 links used by thread 0Level-1 links used by thread 1\fIn the second step, every synchronization thread further exchanges its partially-aggregated gradient\npieces with 1-hop neighboring servers in another level, i.e., thread 0 taking level-1 links and thread\n1 taking level-0 links. Fig. 4(c) shows the process. We again use synchronization thread 0 on server\n[0, 0] as the example. It sends the partially-aggregated gradient piece < 0, 1, 0 > ([0,\u2217]) to server\n[1, 0] and < 0, 2, 0 > ([0,\u2217]) to server [2, 0], respectively. At the same time it receives partially-\naggregated gradient pieces < 0, 0, 0 > ([1,\u2217]) from server [1, 0] and < 0, 0, 0 > ([2,\u2217]) from server\n[2, 0], respectively. Together with the local partially-aggregated piece < 0, 0, 0 > ([0,\u2217]), the fully-\naggregated result for gradient piece < 0, 0, 0 > based on all the 9 servers is made, represented as\n< 0, 0, 0 > ([\u2217,\u2217]). Similarly, each synchronization thread makes full aggregation of one gradient\npiece. This step takes the theoretical time of 2 \u2217 TC, and ends the aggregation stage.\nIn the \ufb01rst step of the broadcast stage, every thread on a server broadcasts its fully-aggregated\ngradient piece to 1-hop neighboring servers, as shown in Fig.4(d). Thread 0 takes level-1 links and\nthread 1 takes level-0 links. We still use thread 0 on server [0, 0] as the example. It broadcasts the\ngradient piece of < 0, 0, 0 > ([\u2217,\u2217]) to servers < 1, 0 > and < 2, 0 > simultaneously. At the\nsame time, it receives gradient piece < 0, 1, 0 > ([\u2217,\u2217]) from server < 1, 0 > and receives piece\n< 0, 2, 0 > ([\u2217,\u2217]) from server < 2, 0 >. This step takes the theoretical time of 2 \u2217 TC. After this\nstage, each thread on a server gets 3 fully-aggregated gradient pieces.\nIn the second step, every thread broadcasts its 3 fully-aggregated gradient pieces to 1-hop neigh-\nboring servers in another level, namely, thread 0 taking level-0 links while thread 1 taking level-1\nlinks. It is easy to infer that this step takes the theoretical time of 6 \u2217 TC, and the \ufb01nal state is\nshown in Fig. 4(e). This is the end of the broadcast stage. Each server gets a complete set of 18\nfully-aggregated gradient pieces, which is used to update the model parameters. The gradient syn-\nchronization traf\ufb01c makes full utilization of the network bandwidth, and the total theoretical GST\nfor a Bcube(3,2) network is 16 \u2217 TC = 8\nGeneral Algorithm: Next we describe the general BML algorithm run in thread t of server a in a\nBCube(n,k) network. Note that k threads simultaneously run the same algorithm on each server.\nAs aforementioned, the algorithm takes k steps in the aggregation stage and k steps in the broadcast\nstage. In the aggregation stage, the k steps use the level-(t mod k), level-((t + 1) mod k), ..., level-\n((t + k \u2212 1) mod k) links respectively, to synchronize the gradient pieces with 1-hop neighboring\nservers. Therefore, in every step, the links taken by the k gradient synchronization threads on server\na do not collide with each other. In step w (w \u2208 [0, k \u2212 1]), thread t sends N\nn(w+1) gradient pieces to,\nand receives the same number of gradient pieces from, each of its 1-hop neighboring servers. The\nID\u2019s of the exchanged gradient pieces are speci\ufb01ed by the functions of CalcGset() and CheckDigits().\nn(w+1) \u2217 (n \u2212 1) \u2217 TC. Taking all the k steps together, the total time\nHence, step w takes the time of N\nin the aggregation stage is (N \u2212 1) \u2217 TC. The broadcast stage works similarly with the aggregation\nstage, except that in each step a thread broadcasts the fully-aggregated gradient pieces to its 1-hop\nneighboring servers instead of making aggregation. It takes the same time as the aggregation stage.\nThe total GST for a BCube(n,k) network is thus 2 \u2217 (N \u2212 1) \u2217 TC = 2\u2217(N\u22121)\nTherefore, compared with the theoretical GST of 2\u2217(N\u22121)\n\u2217 TF when running PS algorithm on a\nFat-Tree network, BML algorithm on a BCube(n,k) network theoretically uses only 1\nk of GST, with\nless network cost. One may argue that we can also equip multiple NICs on a server to connect a\nFat-Tree network. However, it leads to more switches and several times higher network cost.\n\nk\u2217N \u2217 TF .\n\n9 \u2217 TF .\n\nN\n\n3.3 The Importance of both BCube Topology and BML Algorithm\n\nWe choose BCube as the underlay DML network to replace Fat-Tree for two reasons. First, as one\nof the server-centric network topologies, BCube has well-known lower network cost than Fat-Tree.\nSecond, compared with other server-centric network topologies such as DCell [13] and FiConn [16],\nBCube has a nice topological feature that it can better speedup gradient synchronization in DML.\nThis speedup is realized by the BML algorithm, which works in a hierarchical way and the inter-\nmediate results are aggregated to reduce the traf\ufb01c load. On the contrary, if we run traditional PS\nalgorithm in BCube, it will double the GST compared with running BML. It is because the PS algo-\nrithm works in a \ufb02at way, which not only occupies more links for an individual \ufb02ow but also fails to\nmake aggregation during synchronization.\n\n6\n\n\fAlgorithm 1\nGradient synchronization algorithm of thread t on server a in a BCube(n,k) network\ns.G: A set of gradient pieces G on a server s\na.GF : The full set of gradient pieces on server a\nNl(a):The set of server a\u2019s 1-hop neighboring servers under the same level-l switch\ns.d[i]: The i-th digit of a server s\u2019 ID [vk\u22121, ..., v1, v0]\ng.d[i]: The i-th digit of a gradient piece g\u2019s ID < e, vk\u22121, ..., v1, v0 >\n1: a.GF \u2190 full gradient pieces on server a by local training\n2: for w \u2208 [0, k \u2212 1] do\n3:\n4: for w \u2208 [0, k \u2212 1] do\nRunBroadcast(w)\n5:\nfunction RunAggregation(w)\n6: l \u2190 (w + t)mod(k)\n7: for each server s \u2208 Nl(a) do\n8:\n9:\n10: a.GF \u2190 updated full gradient set by aggregating a\u2019s local gradient pieces and received gradient\n\na.G \u2190 CalcGset(s,w,a.GF )\nTransmit a.G to s and receive s.G from s through level-l link\n\nRunAggregation(w)\n\npieces from Nl(a)\n\nfunction RunBroadcast(w)\n11: l \u2190 (k \u2212 1 \u2212 w + t)mod(k)\n12: a.G \u2190 CalcGset(a,k \u2212 1 \u2212 w,a.GF )\n13: for each server s \u2208 Nl(a) do\n14:\n15: a.GF \u2190 updated full gradient set by replacing a\u2019s local gradient pieces with received gradients\n\nTransmit a.G to s and receive s.G from s through level-l link\n\npieces from Nl(a)\n\nput g into R\n\nfunction CalcGset(s,w,a.GF )\n16: R \u2190 \u2205\n17: for each gradient piece g \u2208 a.GF do\nif CheckDigits(g,s,w)=True then\n18:\n19:\n20: return R\nfunction CheckDigits(g,s,w)\n21: for i \u2208 [0, w] do\nj \u2190 (i + t)mod(k)\n22:\nif g.d[j] (cid:54)= s.d[j] or g.d[k] (cid:54)= t then\n23:\n24:\n25: return True\n\nreturn False\n\nOn the other side, we can run BML-like hierarchical synchronization algorithm on a Fat-Tree net-\nwork, where servers are \ufb01rst grouped by edge switches, then grouped by aggregation switches, and\n\ufb01nally grouped by pods. But this synchronization algorithm results in the same GST as running\nPS algorithm in Fat-Tree. The main reason is that the single NIC at each server in Fat-Tree cannot\nspeedup the synchronization. Although there are more switches in Fat-Tree than in BCube, they are\namortized since each synchronization \ufb02ow in Fat-Tree traverses multiple switches. If we use k NICs\nat each server to connect the Fat-Tree network, as BCube server does, it can indeed lead to the same\nGST as BCube; but we need k times more switches to support the same number of servers, which\nfurther increases the cost gap between BCube and Fat-Tree.\nWe use the Fig.5 to demonstrate the GSTs when running PS-based algorithm and BML (hierarchi-\ncal synchronization algorithm) on both Fat-Tree and BCube networks. It clearly demonstrates the\nnecessity to choose both BCube topology and the BML algorithm for DML.\n\n7\n\n\f)\nF\nT\n\u00d7\n(\nT\nS\nG\n\n2.5\n2\n1.5\n1\n0.5\n0\n\n16\n1024\nNumber of servers\n\nBML/BCube\n\nPS/BCube\n\nHierarchical/Fat-Tree\n\nPS/Fat-Tree\n\nFigure 5: GST of running BML (hierarchical synchronization) and PS-based algorithms on BCube and\nFat-Tree networks respectively. 16 servers refer to BCube(4,2) and Fat-Tree(4), while 1024 servers refer to\nBCube(32,2) and Fat-Tree(16).\n4\n\nImplementation and Experiments\n\n4.1\n\nImplementation\n\nWe implement BML in TensorFlow [4]. The current gradient synchronization algorithms imple-\nmented in the open-source version of TensorFlow is the PS algorithm. Our implementation of BML\nincludes 4550 lines of C++ code and 702 lines of Python code. It contains three main modules,\nnamely, sending module, receiving module and management module. The sending module get-\ns the gradient pieces from the sending queues (enqueued by the management module) and sends\nthem to the corresponding neighbouring servers. The receiving module receives the gradient pieces\nfrom neighbouring servers and submits them to the management module. The management module\nbridges the other two modules, maintains the sending queues, and aggregates the gradient pieces\nbased on the remote ones and local ones in the aggregation stage.\nIt is worth noting that, for deep learning, the neural network model has more than one layers. In\nthe back-propagation algorithm [20, 21], the gradients of the model are computed layer by layer.\nWhen the gradients for one layer are computed out, they can be transmitted while the gradients for\nother layers are still under computation. The time of computing and transmission can thus overlap.\nTherefore, in our implementation we divide the gradients for each layer of the model into k \u2217 N\npieces in a BCube(n,k) network (with N = nk), instead of simply dividing the gradients for the\nentire model into k \u2217 N pieces. In this way, the gradient synchronization load on each server is well\nbalanced.\n\n4.2 Experiment Setting\n\nWe build a BCube(3,2) testbed with 9 dual-GPU servers and multiple 40GE switches. Each server\nis equipped with two Nvidia Tesla K40C GPUs, two Intel Xeon E5 CPUs, 64GB DRAM and two\n40Gbps Mellanox Connectx-3 NICs. To compare BML with Fat-Tree, we also build a Fat-Tree\nnetwork with the same number of GPU servers. Since the network size is not very large, we simply\nuse a single 40GE switch to connect all the 9 servers, with each server using one NIC to connect\nthe switch. It fully emulates the Fat-Tree network, as the network bandwidth is non-blocking. We\nrun the P2P based PS algorithm for gradient synchronization in Fat-Tree, where each server plays\nas both a parameter server and a worker. RoCE (RDMA over Converged Ethernet) is used as the\ntransport protocol in all these networks.\nWe run two representative public deep learning benchmarks, namely, LeNet-5 [15] and VGG-\n19 [22], in each network. LeNet-5 has a 5-layer model , with a total number of 3.27 million pa-\nrameters. The size of full gradient is about 12.5MB (with single-precision \ufb02oating point). The\nmodel of VGG-19 contains 19 layers and the total number of parameters is about 143.65 million.\nThe full gradient size is thus 548MB. MNIST [3] and ImageNet [2] datasets are used as training\ndata for LeNet-5 and VGG-19 respectively. To study the impact of minibatch size, we set different\nsub-minibatch sizes on a training server in different rounds of experiments. Since we only focus on\nthe training speed of DML, we \ufb01x the number of iterations trained in each round of the experiment\nas 1000 and measure the job completion time(JCT) of the benchmark.\n\n8\n\n\f4.3 Results and Analysis\n\n)\ns\nd\nn\no\nc\ne\ns\n(\nT\nC\nJ\n\n150\n\n120\n\n90\n\n60\n\n30\n\n0\n\n128\n\n256\n\n512\n\nSub-minibatch size\nFat-Tree\n\nBML\n\n)\ns\nd\nn\no\nc\ne\ns\n(\nT\nC\nJ\n\n4,400\n\n3,300\n\n2,200\n\n1,100\n\n0\n\n1024\n\n4\n\n16\n\n32\n\nSub-minibatch size\nFat-Tree\n\nBML\n\n64\n\nFigure 6: Experiment result of LeNet-5.\n\nFigure 7: Experiment result of VGG-19.\n\nLeNet-5: Fig. 6 illustrates the results for LeNet-5 benchmark on MNIST dataset. We set the sub-\nminibatch size on each server as 128, 256, 512 and 1024 in different rounds. Compared with Fat-\nTree, BML reduces the JCT by 18.7%\u223c56.4%. The gain comes from the following two causes.\nFirst, the theoretical GST in a BCube(n,k) is 1\nk of that in Fat-Tree. With k = 2 in this experiment,\nthe GST of each training iteration in BML should be about half of that in Fat-Tree. Considering the\ncomputation time, theoretically BML should reduce the JCT by 0%\u223c50% compared with Fat-Tree.\nSecond, the current implementation of the PS algorithm in TensorFlow maps the gradients to the\nparameter servers on per-tensor basis [25]. As different tensors have different sizes, the loads on the\nFat-Tree servers are not balanced. Hence, in the experiments we \ufb01nd that in some cases BML can\nreduce the JCT by more than 50%.\nWe also observe that, with smaller sub-minibatch size on a server, the performance gap between\nBML and Fat-Tree is larger, because the communication cost has a higher weight in the whole\ntraining job. As introduced in Section 1, in order to scale DML to large size without degrading\nthe model quality, usually we have to set a relatively small sub-minibatch size per server. The\nexperiment demonstrates that BML has particular advantage in this scenario.\nVGG-19: The results for VGG-19 benchmark on ImageNet dataset is shown in Fig. 7. The model\nsize of VGG-19 is much larger than LeNet-5, so it takes more time than LeNet-5 to \ufb01nish the 1000\niterations of training. However, the performance gap between the three DML networks are very\nsimilar with that in LeNet-5. Generally, BML reduces the JCT by 29.2%\u223c52.1% compared with\nFat-Tree network.\n\n5 Conclusion\n\nIn this paper we design BML, a new DML gradient synchronization algorithm with higher per-\nformance and lower cost. BML runs on BCube topology instead of the commonly-used Fat-Tree\nnetwork in current data centers. Compared with the PS algorithm running on a Fat-Tree network\nconnecting the same number of servers, BML achieves 1\n5 switches.\nThe experiments of typical deep learning benchmarks on Tensor\ufb02ow also validate the performance\ngains of BML.\n\nk of of the GST while using only k\n\n9\n\n\f6 Acknowledgement\n\nThe work was supported by the National Key Basic Research Program of China (973 program)\nunder Grant 2014CB347800, and the National Natural Science Foundation of China under Grant\nNo. 61522205, No. 61772305, No. 61432002, No. 61771273. Dan Li is the corresponding author\nof this paper.\n\nReferences\n[1] caffe2 website. http://caffe2.ai/.\n\n[2] Imagenet dataset. http://www.image-net.org/.\n\n[3] The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/.\n\n[4] M. Abadi, P. Barham, J. Chen, et al. Tensor\ufb02ow: A system for large-scale machine learning.\n\nIn USENIX OSDI\u201916, 2016.\n\n[5] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network archi-\n\ntecture. In ACM SIGCOMM\u201908, 2008.\n\n[6] T. Chen, M. Li, Y. Li, et al. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for\nheterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015. http://arxiv.\norg/abs/1512.01274.\n\n[7] H.-T. Cheng, L. Koc, J. Harmsen, et al. Wide & deep learning for recommender systems. In\n\nDLRS\u201916, 2016.\n\n[8] R. D. S. Couto, S. Secci, M. E. M. Campista, and L. H. M. K. Costa. Reliability and surviv-\n\nability analysis of data center network topologies. CoRR, abs/1510.02735, 2015.\n\n[9] W. Dai, A. Kumar, J. Wei, et al. High-performance distributed ml at scale through parameter\n\nserver consistency models. In AAAI, pages 79\u201387, 2015.\n\n[10] J. Dean, G. Corrado, et al. Large scale distributed deep networks. In NIPS\u201912.\n\n[11] P. Goyal, P. Doll\u00b4ar, R. B. Girshick, et al. Accurate, large minibatch SGD: training imagenet in\n\n1 hour. arXiv preprint arXiv:1706.02677, 2017. http://arxiv.org/abs/1706.02677.\n\n[12] C. Guo, G. Lu, D. Li, et al. Bcube: A high performance, server-centric network architecture\n\nfor modular data centers. In ACM SIGCOMM\u201909, 2009.\n\n[13] C. Guo, H. Wu, K. Tan, et al. Dcell: A scalable and fault-tolerant network structure for data\n\ncenters. In ACM SIGCOMM\u201908, 2008.\n\n[14] K. Hazelwood, S. Bird, D. Brooks, et al. Applied machine learning at facebook: A datacenter\n\ninfrastructure perspective. In IEEE HPCA\u201918, 2018.\n\n[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, Nov 1998.\n\n[16] D. Li, C. Guo, H. Wu, et al. Scalable and cost-effective interconnection of data-center servers\nusing dual server ports. IEEE/ACM Transactions on Networking (TON), 19(1):102\u2013114, 2011.\n\n[17] M. Li, D. G. Andersen, J. W. Park, et al. Scaling distributed machine learning with the param-\n\neter server. In USENIX OSDI\u201914, 2014.\n\n[18] R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured percep-\ntron. In Human Language Technologies: The 2010 Annual Conference of the North American\nChapter of the Association for Computational Linguistics, HLT \u201910, pages 456\u2013464, 2010.\n\n[19] P. Patarasuk and X. Yuan. Bandwidth optimal all-reduce algorithms for clusters of worksta-\n\ntions. Journal of Parallel and Distributed Computing, 69(2):117\u2013124, 2009.\n\n10\n\n\f[20] D. E. RUMELHART. Learning representations by back propagation error. Nature, 323:533\u2013\n\n536, 1986.\n\n[21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research.\nchapter Learning Representations by Back-propagating Errors, pages 696\u2013699. MIT Press,\n1988.\n\n[22] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recog-\n\nnition. arXiv preprint arXiv:1409.1556, 2014. http://arxiv.org/abs/1409.1556.\n\n[23] A. Smola. Machine learning - progress and opportunities. Speech on AI World 2017, 2017.\n\nhttps://goo.gl/emn8np.\n\n[24] Y. You, Z. Zhang, C. Hsieh, et al. 100-epoch imagenet training with alexnet in 24 minutes.\n\narXiv preprint arXiv:1709.05011, 2017. http://arxiv.org/abs/1709.05011.\n\n[25] H. Zhang, Z. Zheng, S. Xu, et al. Poseidon: An ef\ufb01cient communication architecture for\n\ndistributed deep learning on gpu clusters. USENIX ATC\u201917, 2017.\n\n[26] H. Zhao and J. Canny. Butter\ufb02y mixing: Accelerating incremental-update algorithms on clus-\n\nters. In SIAM SDM\u201913, pages 785\u2013793.\n\n11\n\n\f", "award": [], "sourceid": 2078, "authors": [{"given_name": "Songtao", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Dan", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Yang", "family_name": "Cheng", "institution": "Tsinghua University"}, {"given_name": "Jinkun", "family_name": "Geng", "institution": "Tsinghua University"}, {"given_name": "Yanshu", "family_name": "Wang", "institution": "Tsinghua Univeristy"}, {"given_name": "Shuai", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Shu-Tao", "family_name": "Xia", "institution": "Tsinghua University"}, {"given_name": "Jianping", "family_name": "Wu", "institution": "Tsinghua University"}]}