我编写了一个 dockerfile 来构建 docker 镜像。此 docker 映像已安装旧版本的 cuda 驱动程序。所以我需要卸载它们,并安装一些更新的。这是 dockerfile 的部分。
ENTRYPOINT [ "/bin/bash", "-l", "-c" ]
RUN sudo yum remove -y xorg-x11-drv-nvidia nvidia-kmod cuda-drivers /usr/local/cuda-10.0
RUN rm -rf /usr/local/cuda-10.0
RUN wget https://developer.download.nvidia.com/compute/cuda/11.4.1/local_installers/cuda-repo-rhel7-11-4-local-11.4.1_470.57.02-1.x86_64.rpm
RUN sudo rpm -i cuda-repo-rhel7-11-4-local-11.4.1_470.57.02-1.x86_64.rpm
RUN sudo yum -y install nvidia-driver-latest-dkms cuda cuda-drivers; sudo yum clean all;
RUN sudo nvidia-smi
RUN export PATH=/usr/local/cuda-11.4/bin${PATH:+:${PATH}}
RUN export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
RUN wget http://developer.download.nvidia.com/compute/redist/cudnn/v8.2.2/cudnn-11.4-linux-x64-v8.2.2.26.tgz
RUN tar -zxf cudnn-11.4-linux-x64-v8.2.2.26.tgz
RUN sudo cp cuda/include/cudnn*.h /usr/local/cuda/include && sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64 && sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
RUN export LIBRARY_PATH=/usr/local/cuda/lib64${LIBRARY_PATH:+:${LIBRARY_PATH}}
RUN cd .. && rm -rf cuda && rm -rf cudnn-11.4-linux-x64-v8.2.2.26.tgz
RUN pip install torch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0
但是,构建镜像后,我在调用 nvidia-smi 时发现错误:
Failed to initialize NVML: Driver/library version mismatch
您能给我任何建议来解决这个问题吗?
PS:操作系统版本为CentOS 7。
根据您提供的信息,我无法直接说出问题是什么,但对我来说,问题是使用这个容器作为基础
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
然后仍然安装nvidia-drivier-525
。删除该行后,效果很好:)