Index of /examples/machine_learning/multiple_gpus_tensorflow

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory   -  

Multi-GPU TensorFlow Python Code Example

Using multiple GPUs with Tensorflow on the SCC

This example demonstrates how to request multiple GPUs and utilize them to train a Deep Neural Network model using Tensorflow. The Python code is procured from the TensorFlow Keras distributed training tutorial

Request multiple GPUs (e.g. 2) on SCC

This command will launch an interactive session, with the requested resources. Of course, instead you can also submit a job file.

qrsh -pe omp 1 -l h_rt=1:00:00 -l gpus=2 -l gpu_c=3.5 -P your_project_name

Check present GPUs

nvidia-smi

Tue Mar 23 20:34:17 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K40m          On   | 00000000:04:00.0 Off |                    0 |
| N/A   31C    P8    20W / 235W |      0MiB / 11441MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          On   | 00000000:82:00.0 Off |                    0 |
| N/A   32C    P8    20W / 235W |      0MiB / 11441MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Load modules on scc

We load the relevant modules and launch python.

module load python3/3.8.6
module load tensorflow/2.3.1

Install Tensorflow Datasets

This Python package is not available on the SCC. This should not stop us. We install the required package using PIP, following the guidelines set here. As recommended, we install it to our project's location. Afterwards, we ammend the PYTHONPATH so that our newly-installed package can be detected.

pip install --prefix=/projectnb/projectname/pythonlibs tensorflow-datasets
export PYTHONPATH=/projectnb/projectname/pythonlibs/lib/python3.8/site-packages/:$PYTHONPATH
export PATH=/projectnb/projectname/pythonlibs/bin:$PATH

Start a Python Session

python

Import TensorFlow and TensorFlow Datasets

import tensorflow_datasets as tfds
import tensorflow as tf

import os

Check out the GPU IDs avaible by retrieving an env variable

gpu_ids = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
gpu_ids

['0', '1']

Download the dataset

Here it's downloading to the default location. You can adjust this so that the data is dowloaded to a location of choice.

datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets["train"], datasets["test"]

Define distribution strategy

strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

print("Number of devices: {}".format(strategy.num_replicas_in_sync))

Number of devices: 2

Setup input pipeline

# You can also do info.splits.total_num_examples to get the total
# number of examples in the dataset.

num_train_examples = info.splits["train"].num_examples
num_test_examples = info.splits["test"].num_examples

BUFFER_SIZE = 10000

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync


def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255

    return image, label


train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

Create the model

with strategy.scope():
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(64, activation="relu"),
            tf.keras.layers.Dense(10),
        ]
    )

    model.compile(
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=tf.keras.optimizers.Adam(),
        metrics=["accuracy"],
    )

Define the callbacks

Here the checkpoints directory can be set to a location of choice.

# Define the checkpoint directory to store the checkpoints
checkpoint_dir = "./training_checkpoints"
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

# Function for decaying the learning rate.
# You can define any decay function you need.


def decay(epoch):
    if epoch < 3:
        return 1e-3
    elif epoch >= 3 and epoch < 7:
        return 1e-4
    else:
        return 1e-5


# Callback for printing the LR at the end of each epoch.
class PrintLR(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        print(
            "\nLearning rate for epoch {} is {}".format(
                epoch + 1, model.optimizer.lr.numpy()
            )
        )


callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir="./logs"),
    tf.keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_prefix, save_weights_only=True
    ),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR(),
]

Train and evaluate

model.fit(train_dataset, epochs=12, callbacks=callbacks)

Epoch 1/12
WARNING:tensorflow:From /share/pkg.7/tensorflow/2.3.1/install/lib/SCC/../python3.8/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.
WARNING:tensorflow:From /share/pkg.7/tensorflow/2.3.1/install/lib/SCC/../python3.8/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
  1/469 [..............................] - ETA: 0s - loss: 2.2931 - accuracy: 0.1172WARNING:tensorflow:From /share/pkg.7/tensorflow/2.3.1/install/lib/SCC/../python3.8/site-packages/tensorflow/python/ops/summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
WARNING:tensorflow:From /share/pkg.7/tensorflow/2.3.1/install/lib/SCC/../python3.8/site-packages/tensorflow/python/ops/summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
  2/469 [..............................] - ETA: 13s - loss: 2.2534 - accuracy: 0.1797WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0060s vs `on_train_batch_end` time: 0.0508s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0060s vs `on_train_batch_end` time: 0.0508s). Check your callbacks.
466/469 [============================>.] - ETA: 0s - loss: 0.2458 - accuracy: 0.9296 
Learning rate for epoch 1 is 0.0010000000474974513
469/469 [==============================] - 2s 4ms/step - loss: 0.2448 - accuracy: 0.9299
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Epoch 2/12
458/469 [============================>.] - ETA: 0s - loss: 0.0850 - accuracy: 0.9753
Learning rate for epoch 2 is 0.0010000000474974513
469/469 [==============================] - 2s 4ms/step - loss: 0.0848 - accuracy: 0.9754
Epoch 3/12
457/469 [============================>.] - ETA: 0s - loss: 0.0567 - accuracy: 0.9835
Learning rate for epoch 3 is 0.0010000000474974513
469/469 [==============================] - 2s 4ms/step - loss: 0.0567 - accuracy: 0.9836
Epoch 4/12
463/469 [============================>.] - ETA: 0s - loss: 0.0364 - accuracy: 0.9900
Learning rate for epoch 4 is 9.999999747378752e-05
469/469 [==============================] - 2s 4ms/step - loss: 0.0362 - accuracy: 0.9901
Epoch 5/12
459/469 [============================>.] - ETA: 0s - loss: 0.0334 - accuracy: 0.9906
Learning rate for epoch 5 is 9.999999747378752e-05
469/469 [==============================] - 2s 4ms/step - loss: 0.0335 - accuracy: 0.9906
Epoch 6/12
468/469 [============================>.] - ETA: 0s - loss: 0.0318 - accuracy: 0.9914
Learning rate for epoch 6 is 9.999999747378752e-05
469/469 [==============================] - 2s 4ms/step - loss: 0.0318 - accuracy: 0.9914
Epoch 7/12
467/469 [============================>.] - ETA: 0s - loss: 0.0302 - accuracy: 0.9918
Learning rate for epoch 7 is 9.999999747378752e-05
469/469 [==============================] - 2s 4ms/step - loss: 0.0302 - accuracy: 0.9918
Epoch 8/12
462/469 [============================>.] - ETA: 0s - loss: 0.0279 - accuracy: 0.9926
Learning rate for epoch 8 is 9.999999747378752e-06
469/469 [==============================] - 2s 4ms/step - loss: 0.0278 - accuracy: 0.9926
Epoch 9/12
456/469 [============================>.] - ETA: 0s - loss: 0.0277 - accuracy: 0.9928
Learning rate for epoch 9 is 9.999999747378752e-06
469/469 [==============================] - 2s 4ms/step - loss: 0.0275 - accuracy: 0.9927
Epoch 10/12
458/469 [============================>.] - ETA: 0s - loss: 0.0275 - accuracy: 0.9928
Learning rate for epoch 10 is 9.999999747378752e-06
469/469 [==============================] - 2s 4ms/step - loss: 0.0273 - accuracy: 0.9929
Epoch 11/12
466/469 [============================>.] - ETA: 0s - loss: 0.0272 - accuracy: 0.9928
Learning rate for epoch 11 is 9.999999747378752e-06
469/469 [==============================] - 2s 4ms/step - loss: 0.0272 - accuracy: 0.9928
Epoch 12/12
466/469 [============================>.] - ETA: 0s - loss: 0.0271 - accuracy: 0.9931
Learning rate for epoch 12 is 9.999999747378752e-06
469/469 [==============================] - 2s 4ms/step - loss: 0.0270 - accuracy: 0.9931

Conclusion

We have now successfully trained our model using multiple GPUs. We will now close the Python session using exit(), and then close the interactive job session using exit.

exit()
exit

Contact Information

Help: help@scv.bu.edu

Note: RCS example programs are provided "as is" without any warranty of any kind. The user assumes the entire risk of quality, performance, and repair of any defect. You are welcome to copy and modify any of the given examples for your own use.