您的位置：首页 > 运维架构 > Docker

docker container 编译安装 tensorflow

2018-08-03 10:17 232 查看

0. build 一个 image，指定 image 名字和版本号。

docker build . -t xxx/yyy:v1

0. build 好了之后，run 这个 image 获得一个 containerID。

注意这里在container内使用GPU的，所以需要runtime参数。这里本地已经安装了cuda-9.0，因此这个 container 能够使用本机 cuda 。

docker run -v $HOME/data:/data --runtime=nvidia -it xxx/yyy:v1

1. 安装 bazel ，参考docker container 安装 bazel.

2. 安装 TensorFlow Python 依赖项：

我的是python2.7，因此有以下

apt-get install python-numpy python-dev python-pip python-wheel

3. 安装cuda工具包。

容器内安装nvprof的步骤：（cuda-9.0）：
apt-get install wget
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
apt-get update
apt-get install cuda-libraries-9-0
apt-get install cuda-command-line-tools-9-0

注：
默认/usr/local/cuda 是与 /usr/local/cuda-9.0 软链接的。若是在cuda-9.1，需要安装cuda-9.1版本，另外建立/usr/local/cuda 与 /usr/local/cuda-9.1 的软链接

安装好了之后，可以检查是否安装成功：

nvcc -V

是否打印版本号。

4. CUPTI 附带 CUDA 工具包:

您还需要将其路径附加到 LD_LIBRARY_PATH 环境变量上：

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64

5. cuDNN SDK（v3 及更高版本）。

我们建议使用 7.0 版。如需了解详情，请参阅 NVIDIA 文档，下载cudnn，需要注册申请下载。

root@4f2a38da633a:~/transformer# dpkg -i libcudnn7_7.1.4.18-1+cuda9.0_amd64.deb
(Reading database ... 17528 files and directories currently installed.)
Preparing to unpack libcudnn7_7.1.4.18-1+cuda9.0_amd64.deb ...
Unpacking libcudnn7 (7.1.4.18-1+cuda9.0) over (7.1.4.18-1+cuda9.0) ...
Setting up libcudnn7 (7.1.4.18-1+cuda9.0) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...

root@4f2a38da633a:~/transformer# dpkg -i libcudnn7-dev_7.1.4.18-1+cuda9.0_amd64.deb
Selecting previously unselected package libcudnn7-dev.
(Reading database ... 17528 files and directories currently installed.)
Preparing to unpack libcudnn7-dev_7.1.4.18-1+cuda9.0_amd64.deb ...
Unpacking libcudnn7-dev (7.1.4.18-1+cuda9.0) ...
Setting up libcudnn7-dev (7.1.4.18-1+cuda9.0) ...
update-alternatives: using /usr/include/x86_64-linux-gnu/cudnn_v7.h to provide /usr/include/cudnn.h (libcudnn) in auto mode

root@4f2a38da633a:~/transformer# dpkg -i libcudnn7-doc_7.1.4.18-1+cuda9.0_amd64.deb
Selecting previously unselected package libcudnn7-doc.
(Reading database ... 17534 files and directories currently installed.)
Preparing to unpack libcudnn7-doc_7.1.4.18-1+cuda9.0_amd64.deb ...
Unpacking libcudnn7-doc (7.1.4.18-1+cuda9.0) ...
Setting up libcudnn7-doc (7.1.4.18-1+cuda9.0) ...

我用上述方法试了不行，还得用官网提供的第一种方法：2.3.1. Installing from a Tar File

tar -xzvf cudnn-9.0-linux-x64-v7.tgz
cp cuda/include/cudnn.h /usr/local/cuda/include
root@4f2a38da633a:~/transformer# cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
root@4f2a38da633a:~/transformer# chmod a+r /usr/local/cuda/include/cudnn.h
root@4f2a38da633a:~/transformer# /usr/local/cuda/lib64/libcudnn*
Segmentation fault (core dumped)

最后一行出现了一个段错误。经查阅，有人已经解释了这个问题可以忽略：
1 Segmentation fault (core dumped)
2 cuDNN segfaults when executing .so files

检查一下 libcudnn.so.7 有没有：

ls -lh /usr/local/cuda-9.0/lib64/libcudnn.so.7
-rwxr-xr-x 1 root root 318M Aug  1 09:26 /usr/local/cuda-9.0/lib64/libcudnn.so.7

7.还有一个要安装, nccl:

# dpkg -i nccl-repo-ubuntu1604-2.2.13-ga-cuda9.0_1-1_amd64.deb
Selecting previously unselected package nccl-repo-ubuntu1604-2.2.13-ga-cuda9.0.
(Reading database ... 17586 files and directories currently installed.)
Preparing to unpack nccl-repo-ubuntu1604-2.2.13-ga-cuda9.0_1-1_amd64.deb ...
Unpacking nccl-repo-ubuntu1604-2.2.13-ga-cuda9.0 (1-1) ...
Setting up nccl-repo-ubuntu1604-2.2.13-ga-cuda9.0 (1-1) ...
# apt update
# apt install libnccl2 libnccl-dev

建议这个时候先配置好nccl的软连接。不然后面configure的时候会有麻烦。如果找不到nccl的安装路径，使用 Linux查看某个库的位置/是否安装方法找到，如果软连接的位置不正确，也是需要就地修改的。

ln -s /usr/include/nccl.h /usr/local/cuda-9.0/targets/x86_64-linux/include/nccl.h
ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2.2.13 /usr/local/cuda-9.0/targets/x86_64-linux/lib/libnccl.so.2

6. 拉取 tensorflow

root@4f2a38da633a:~/transformer# git clone https://github.com/tensorflow/tensorflow
Cloning into 'tensorflow'...
remote: Counting objects: 396000, done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 396000 (delta 4), reused 9 (delta 0), pack-reused 395980
Receiving objects: 100% (396000/396000), 181.86 MiB | 659.00 KiB/s, done.
Resolving deltas: 100% (314846/314846), done.
Checking connectivity... done.
root@4f2a38da633a:~/transformer# cd tensorflow/
root@4f2a38da633a:~/transformer/tensorflow# ./configure

7. 进入配置项, configure

root@4f2a38da633a:~/transformer/tensorflow# ./configure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.16.0 installed.
Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python2.7

Found possible Python library paths:
/usr/local/lib/python2.7/dist-packages
/usr/lib/python2.7/dist-packages
Please input the desired Python library path to use. Default is [/usr/local/lib/python2.7/dist-packages]

Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: y
jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon AWS Platform support? [Y/n]: n
No Amazon AWS Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]:

Please specify the location where CUDA 9.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]:

Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:

Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.

Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse pe
rformance with multiple GPUs. [Default is 2.2]:

Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/cuda-9.0/targets/x86_64-li
nux

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,7.0]: 6.0

Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:

Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
Configuration finished

可以看到configure配置完毕，其中nccl部分比较曲折。nccl那的原则是将

8 build

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

这个过程报错，总结起来是许多cuda头文件没有在include内，因此干脆将宿主机的include 全复制过来，：

Compile TensorFlow v1.8.0 cuda/include/cublas_v2.h: No such file or directory

参考官方文档：
从源代码安装 TensorFlow

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航