Skip to content

Many Smaller GPUs - Elegant In-Model Distribution

GPUs (and TPUs) are no doubt one of our most critical resources when running neural networks.

GPUs have strict capacities and limits as physical resources. When using servers with many smaller GPUs, we often ran into the inherent physical limitations of our equipment.

Laying out a possibly large model across many smaller GPUs has thus become a requirement for us. This blog presents a few basic steps in that direction.

The outlined model can also be used to effectively test more complex and custom GPU allocation strategies.

Our objective here is to arrive at training a model representable by the graph and runnable example.

Just as before, we need to prep our environment to run any meaningful code:

import numpy as np
import tensorflow as tf
from datetime import datetime

We also define a few convenient aliases:

ks = tf.keras
kl = ks.layers
cfg = tf.config.experimental

For any effective and generalizable allocation strategy, we need to be able to reason about our resources uniformly.

Normalized virtual GPUs

We start with the new TF functionality of partitioning our physical GPUs into custom-sized, and thus easily “normalizable”, virtual GPUs.

The components, or layers of our models, can then expect the ids of the properly sized, or allocated, virtual GPUs.

Given the parameter-driven “resource” requirements of our layers, we can also develop heuristics for partitioning and allocating the physical devices before starting a training session. Such heuristics are beyond the scope of this blog.

devs = ((None, ), (1000, 1000, 1000, 1000, 1000, 1000), (1000, 1000, 1000, 1000, 1000, 1000))
cfg.set_visible_devices(cfg.get_visible_devices('CPU')[:1], 'CPU')
cfg.set_visible_devices(cfg.get_visible_devices('GPU')[:len(devs) - 1], 'GPU')
for d, ms in zip(cfg.get_visible_devices(), devs):
    vs = [cfg.VirtualDeviceConfiguration(m) for m in ms]
    cfg.set_virtual_device_configuration(d, vs)
devs = cfg.list_logical_devices('CPU')
devs += cfg.list_logical_devices('GPU')
print('devices:', [ for d in devs])
  devices: ['/job:localhost/replica:0/task:0/device:CPU:0', '/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7', '/job:localhost/replica:0/task:0/device:GPU:8', '/job:localhost/replica:0/task:0/device:GPU:9', '/job:localhost/replica:0/task:0/device:GPU:10', '/job:localhost/replica:0/task:0/device:GPU:11']

Let’s turn off “soft” allocation for now:


Pipeline of small GPUs

The model we develop here builds on a configurable “stack” of identical layers.

A most basic, custom dense layer class is all we need as the stack’s repeated component.

We aim to “lay” this stack on its side, and onto our virtual GPUs, as a functional, forward-backward propagating pipeline that could now fit in our combined GPU-space.

Each layer of the stack would, therefore, use a predetermined, or heuristically pre-calculated, virtual GPU idx:

class Layer(kl.Layer):
    def __init__(self, i, ps, **kw):
        self.idx = min(i + 1, len(devs) - 1) = ps

    def build(self, input_shape):
        s = input_shape[-1]
        with tf.device(devs[self.idx].name):
            self.w = self.add_weight(name='l_w', shape=(s, s))
            self.b = self.add_weight(name='l_b', shape=(s, ))
        return super().build(input_shape)

    def call(self, x):
        with tf.device(devs[self.idx].name):
            y = tf.einsum('bi,ij->bj', x, self.w) + self.b
        return y

A sample sequential model

A basic sequential Keras model will suffice as the container of our stack.

Once our input, as well as the output, is shaped, we simply chain our chosen number of layers together, in the middle of our basic sequential model.

The Keras model’s summary method is handy to confirm our model is laid out just as intended:

def model_for(ps):
    m = ks.Sequential()
    m.add(kl.Dense(ps.dim_hidden, input_dim=ps.dim_input, name='in'))
    for i in range(ps.num_layers):
        m.add(Layer(i, ps, name=f'lay_{i}'))
    m.add(kl.Dense(ps.dim_input, name='out'))
    m.compile(optimizer=ps.optimizer(), loss=ps.loss(), metrics=[ps.metrics()])
    return m

Symbolic parameters

Before we can run our model, we need to establish our parameters, the various dimensions and other attributes we want to use to shape the training.

A simple Python dict works best to keep things organized, unique and also sorted:

params = dict(

The drawback of string keyed dicts is just that, the strings can have typos in them and hundreds of potentially misnamed parameters, later on, will certainly cause unnecessary confusion.

Python’s automatically verified native attributes come to the rescue once again.

Here is a simple, straightforward and functional Params class:

class Params:
    def __init__(self, **kw):
        for k, v in kw.items():
            setattr(self, k, v)

Let’s create our Params instance and a truly handy training data set (with testing and verification all built-in) in just one line of code:

ps = Params(**params)
import numpy as np
d = np.ones((100, ps.dim_input))

10 million weights to validate with

Finally, we are ready to compile our model. And, just as expected, the summary of the model shows that it has over 10 million weights.

The initial values of the weights are randomly picked. Through training, we bring these arbitrary values “inline” through millions of multiplications and additions executed by our many virtual GPUs, only to verify that our input ones are, in fact, just a series of 1s:

m = model_for(ps)
  Model: "sequential"
  Layer (type)                 Output Shape              Param #
  in (Dense)                   (None, 1000)              101000
  lay_0 (Layer)                (None, 1000)              1001000
  lay_1 (Layer)                (None, 1000)              1001000
  lay_2 (Layer)                (None, 1000)              1001000
  lay_3 (Layer)                (None, 1000)              1001000
  lay_4 (Layer)                (None, 1000)              1001000
  lay_5 (Layer)                (None, 1000)              1001000
  lay_6 (Layer)                (None, 1000)              1001000
  lay_7 (Layer)                (None, 1000)              1001000
  lay_8 (Layer)                (None, 1000)              1001000
  lay_9 (Layer)                (None, 1000)              1001000
  out (Dense)                  (None, 100)               100100
  Total params: 10,211,100
  Trainable params: 10,211,100
  Non-trainable params: 0

Training the model gives us the familiar Keras output, showing a nice convergence of a trivial problem across easily configurable GPUs:

from datetime import datetime
ld ='%Y%m%d-%H%M%S')
ld = f'/tmp/logs/{ld}'
cs = [ks.callbacks.TensorBoard(log_dir=ld, histogram_freq=1)], d, callbacks=cs, epochs=10, batch_size=10)
  Train on 100 samples
  Epoch 1/10
  100/100 [==============================] - 2s 17ms/sample - loss: 0.4796 - mean_absolute_error: 0.4796
  Epoch 2/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.2872 - mean_absolute_error: 0.2872
  Epoch 3/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.2414 - mean_absolute_error: 0.2414
  Epoch 4/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.2252 - mean_absolute_error: 0.2252
  Epoch 5/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.1988 - mean_absolute_error: 0.1988
  Epoch 6/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.1984 - mean_absolute_error: 0.1984
  Epoch 7/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.1734 - mean_absolute_error: 0.1734
  Epoch 8/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.1551 - mean_absolute_error: 0.1551
  Epoch 9/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.1719 - mean_absolute_error: 0.1719
  Epoch 10/10
  100/100 [==============================] - 1s 7ms/sample - loss: 0.1527 - mean_absolute_error: 0.1527

And now let’s fire up TensorBoard and visually confirm that our stack of “dense” layers is connected just as expected.

If you haven’t run the code, an already generated graph is here.

#%load_ext tensorboard
#%tensorboard --logdir /tmp/q/logs

This concludes our blog, please see how to use the new dataset functionality by clicking on the next blog.