Background

chpc recommends conda to install python but using conda causes error because conda uses its own cudnn which seems to interfere with system installed cuda. So the alternative is using python either from module load or using linuxbrew which is the version I use. So after choosing python. I recommend using virtualenv.

Install

# module load python/3.7.3.lua # but I am using python from linuxbrew
python3 -m venv tfvenv
module load cuda/10.0.lua
module load cudnn/7.6.2.lua
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64 # <= is needed if you are using python from linuxbrew
pip3 install tensorflow-gpu
# I also highly recommend
pip3 install ipython ipdb keras

Testing

# get chpc gpu enabled node
srun --account=owner-gpu-guest --partition=kingspeak-gpu-guest --nodes=1 \
    --ntasks=1  --gres=gpu:1  --pty /bin/bash -l

If the model doesn’t run, if the python segfaults, most likely tensorflow is running out of host memory, quick fix for this solution is request more memory from srun by specifying --ntasks=2 this will also give two cores on the node, sweet deal!

Now you can check in either of two ways if the installation was okay.

  1. Simple matmult
    import tensorflow as tf
    tf.config.set_soft_device_placement(True)
    tf.debugging.set_log_device_placement(True)
    c = []
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
    print(c)
    
  2. Simple DNN network
    git clone --depth=1 git@github.com:keras-team/keras.git
    cd keras
    nvprof python cifar10_cnn.py
    

    You can press Ctrl+C to avoid whole training and you should see nvprof log.

FAQ

  1. The tensorflow training gets killed when run through nvprof.
    When I run tensorflow training without nvprof it works but when I run via nvprof, it gets killed. See the comment Basically you can:
    export TF_FORCE_GPU_ALLOW_GROWTH=true
    

Happy Coding ❤!