Cases where Multiple GPUs are beneficial

When I encountered to work on multiple projects at the same time, I was supported for using a workstation with 4 GPUs(GeForce 1080Ti, 11G). While delving in to find out the way to efficiently utilize given resources, I figured out that Tensorflow offers some options that enable us to assign multiple GPUs. Below contents describe how to assign GPUs corresponding to your intention, tasks, and models. Specifically, there could be 3 cases. First, multiple jobs(scripts) at different GPUs. Secondly, multiple models at each GPU. Finally, one big model to distributed GPUs.

1. Multi tasks –> Multi GPUs

When you have multiple projects(scripts) and each project requires to use GPU, you can assign each GPU to the corresponding project(script). Without declaration about the list of the visible device, Tensorflow captures all accessible GPUs. To prevent such holdings, assigning Configuration with GPU naming in Session is necessary. Suppose that model is already declared and training with specified epoch will be followed.

# ===== Script A.py ===== #


# Configuration Declaration

a = tf.ConfigProto()
# Assign GPU number to Configuration

a.gpu_options.visible_device_list= "0"
# Finally, allocate Configuration to Session

sess = tf.Session(config=a)
init = tf.global_variables_initializer()
sess.run(init)

# ===== Script B.py ===== #


# Configuration Declaration

b = tf.ConfigProto()
# Assign GPU number to Configuration

b.gpu_options.visible_device_list= "1"
# Finally, allocate Configuration to Session

sess = tf.Session(config=b)
init = tf.global_variables_initializer()
sess.run(init)

2. Multi models –> Multi GPUs

If implementing multiple models at each GPUs(e.g. ensemble) is required, each GPU can be allocated to models(graph). Additionally, if each graph shares lots of common parts such as input, output, dropout, and cost, declare graph with “scope” allow us to access each component of graph more efficiently at training parts. In order to show how to access shared component name across the different graph, I only changed the size of the node at the last layer of each graph and named components same. Although runtime might vary depending on the size of batch and data, training graph parallel across GPUs shorten the time about 1 sec for each iteration rather than training each graph sequentially(e.g. MNIST data with 10 batch size is used).

scopes   = []
# ==== ==== Grpah 1 Declaration

graph1 = tf.Graph()
with graph1.as_default():

    with graph1.name_scope( "g1" ) as scope:

        scopes.append( "g1" + "/" )

        X = tf.placeholder("float", [None, 28, 28, 1], name = "input")
        Y = tf.placeholder("float", [None, 10], name = "label")
        w = init_weights([3, 3, 1, 64])       # 3x3x1 conv, 32 outputs

        w2 = init_weights([3, 3, 64, 64])     # 3x3x32 conv, 64 outputs

        w3 = init_weights([3, 3, 64, 128])    # 3x3x32 conv, 128 outputs

        w4 = init_weights([128 * 4 * 4, 2400]) # FC 128 * 4 * 4 inputs, 2400 outputs

        w_o = init_weights([2400, 10])         # FC 2400 inputs, 10 outputs (labels)


        p_keep_conv = tf.placeholder("float", name = "p_keep_conv")
        p_keep_hidden = tf.placeholder("float", name = "p_keep_hidden")
        py_x = model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden)

        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = py_x, labels = Y), name = "cross_entropy" )
        train_op = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cost, name = "train_step" )
        predict_op = tf.argmax(py_x, 1, name = "predict_op")
        initialize = tf.global_variables_initializer()

# Grpah 2 Declaration

graph2 = tf.Graph()
with graph2.as_default():

    with graph2.name_scope( "g2" ) as scope:

        scopes.append( "g2" + "/" )

        X = tf.placeholder("float", [None, 28, 28, 1], name = "input")
        Y = tf.placeholder("float", [None, 10], name = "label")
        w = init_weights([3, 3, 1, 64])       # 3x3x1 conv, 32 outputs

        w2 = init_weights([3, 3, 64, 64])     # 3x3x32 conv, 64 outputs

        w3 = init_weights([3, 3, 64, 128])    # 3x3x32 conv, 128 outputs

        w4 = init_weights([128 * 4 * 4, 4800]) # FC 128 * 4 * 4 inputs, 4800 outputs

        w_o = init_weights([4800, 10])         # FC 4800 inputs, 10 outputs (labels)


        p_keep_conv = tf.placeholder("float", name = "p_keep_conv")
        p_keep_hidden = tf.placeholder("float", name = "p_keep_hidden")
        py_x = model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden)

        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = py_x, labels = Y), name = "cross_entropy" )
        train_op = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cost, name = "train_step" )
        predict_op = tf.argmax(py_x, 1, name = "predict_op")
        initialize2 = tf.global_variables_initializer()

# Session for each graph and GPU


sessions = []

b = tf.ConfigProto()
b.gpu_options.visible_device_list= "1"
sess1 = tf.Session(graph=graph1,config=b)
sess1.run(initialize)
sessions.append(sess1)


c = tf.ConfigProto()
c.gpu_options.visible_device_list= "2"
sess2 = tf.Session(graph=graph2,config=c)
sess2.run(initialize)
sessions.append(sess2)

# Training by matching Sessions to Grpah with GPUs

for i in range(1000):

    for j in range(len(sessions)):

        training_batch = zip(range(0, len(trX), batch_size), range(batch_size, len(trX)+1, batch_size))

        for start, end in training_batch:

            ys = sessions[j].run([ sessions[j].graph.get_tensor_by_name( scopes[j] + 'cross_entropy:0' ) ],
                                     feed_dict={sessions[j].graph.get_tensor_by_name( scopes[j] + 'input:0' ): trX[start:end],
                                                sessions[j].graph.get_tensor_by_name( scopes[j] + 'label:0' ): trY[start:end],
                                                sessions[j].graph.get_tensor_by_name( scopes[j] + 'p_keep_conv:0' ): 0.8,
                                                sessions[j].graph.get_tensor_by_name( scopes[j] + 'p_keep_hidden:0' ): 0.75})


        test_indices = np.arange(len(teX)) # Get A Test Batch

        np.random.shuffle(test_indices)
        test_indices = test_indices[0:test_size]

        if j == 0 :
            print("Running on GPU # 1 ")
        else:
            print("Running on GPU # 2 ")

        print(i, np.mean(np.argmax(teY[test_indices], axis=1) == sessions[j].run(sessions[j].graph.get_tensor_by_name( scopes[j] + 'predict_op:0' ) ,
                                        feed_dict={sessions[j].graph.get_tensor_by_name( scopes[j] + 'input:0' ): teX[test_indices],
                                                   sessions[j].graph.get_tensor_by_name( scopes[j] + 'label:0' ): teY[test_indices],
                                                   sessions[j].graph.get_tensor_by_name( scopes[j] + 'p_keep_conv:0' ): 1.,
                                                   sessions[j].graph.get_tensor_by_name( scopes[j] + 'p_keep_hidden:0' ): 1.}) ) )

3. One big model –> Distributed GPUs

Once the size of necessary model exceeds that of GPU memory, the model can be divided into multiple jobs where each of them are executed by different GPUs.

#

graph1 = tf.Graph()

with graph1.as_default():

    # GPU Assignments within a Graph


    with tf.device( '/gpu:1' ) :

        X = tf.placeholder("float", [None, 28, 28, 1], name = "input")
        Y = tf.placeholder("float", [None, 10], name = "label")

        w = init_weights([3, 3, 1, 64])       # 3x3x1 conv, 32 outputs

        w2 = init_weights([3, 3, 64, 64])     # 3x3x32 conv, 64 outputs

        w3 = init_weights([3, 3, 64, 128])    # 3x3x32 conv, 128 outputs


    with tf.device( '/gpu:2' ) :

        w4 = init_weights([128 * 4 * 4, 2400]) # FC 128 * 4 * 4 inputs, 2400 outputs

        w_o = init_weights([2400, 10])         # FC 2400 inputs, 10 outputs (labels)


        p_keep_conv = tf.placeholder("float", name = "p_keep_conv")
        p_keep_hidden = tf.placeholder("float", name = "p_keep_hidden")
        py_x = model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden)

        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = py_x, labels = Y), name = "cross_entropy" )
        train_op = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cost, name = "train_step" )
        predict_op = tf.argmax(py_x, 1, name = "predict_op")

# Assign a Graph to session

with tf.Session(graph = graph1) as sess:

    tf.global_variables_initializer().run()

    for i in range(100):
        training_batch = zip(range(0, len(trX), batch_size), range(batch_size, len(trX)+1, batch_size))

        for start, end in training_batch:

            sess.run(train_op, feed_dict={X: trX[start:end], Y: trY[start:end], p_keep_conv: 0.8, p_keep_hidden: 0.5})

        test_indices = np.arange(len(teX)) # Get A Test Batch

        np.random.shuffle(test_indices)
        test_indices = test_indices[0:test_size]

        print(i, np.mean(np.argmax(teY[test_indices], axis=1) ==
                         sess.run(predict_op, feed_dict={X: teX[test_indices],
                                                         Y: teY[test_indices],
                                                         p_keep_conv: 1.0,
                                                         p_keep_hidden: 1.0})))

Wrap up

Hopefully, I wish to improve the performance of carrying out the code by properly applying described methods with multi-GPU. On the other hand, although GPU seems to be powerful for computation-intensive works, it might slow down for the case of running conditional statements relative to CPU. Sp, optimally assigning each task or job to appropriate memory can be another option to boost the performance of work.