Installing Tensorflow on chpc
Background
chpc recommends conda to install python
but using conda causes error because conda uses its own cudnn which seems to interfere with system installed
cuda. So the alternative is using python either from module load
or using
linuxbrew which is the version I use. So after choosing python.
I recommend using virtualenv.
Install
# module load python/3.7.3.lua # but I am using python from linuxbrew
python3 -m venv tfvenv
module load cuda/10.0.lua
module load cudnn/7.6.2.lua
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64 # <= is needed if you are using python from linuxbrew
pip3 install tensorflow-gpu
# I also highly recommend
pip3 install ipython ipdb keras
Testing
# get chpc gpu enabled node
srun --account=owner-gpu-guest --partition=kingspeak-gpu-guest --nodes=1 \
--ntasks=1 --gres=gpu:1 --pty /bin/bash -l
If the model doesn’t run, if the python segfaults, most likely tensorflow is running out of
host memory, quick fix for this solution is request more memory from srun
by specifying
--ntasks=2
this will also give two cores on the node, sweet deal!
Now you can check in either of two ways if the installation was okay.
- Simple matmult
import tensorflow as tf tf.config.set_soft_device_placement(True) tf.debugging.set_log_device_placement(True) c = [] a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c.append(tf.matmul(a, b)) print(c)
- Simple DNN network
git clone --depth=1 git@github.com:keras-team/keras.git cd keras nvprof python cifar10_cnn.py
You can press
Ctrl+C
to avoid whole training and you should see nvprof log.
FAQ
- The tensorflow training gets killed when run through
nvprof
.
When I run tensorflow training without nvprof it works but when I run via nvprof, it gets killed. See the comment Basically you can:export TF_FORCE_GPU_ALLOW_GROWTH=true
Happy Coding ❤!