我正在尝试在gpu服务器上训练MultiLayerNetwork模型,并在调用model.fit()
之后多次迭代遇到内存分配问题。我已经阅读了内存管理文档,并在训练模式下启用了工作区,并且已将批处理大小减小为32,但是似乎仍然出现错误。
系统详细信息:
NVIDIA-SMI 410.72驱动程序版本:410.72 CUDA版本:10.0
03:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
04:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
82:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
83:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1
Java选择:-Xms1G -Xmx2G -Dorg.bytedeco.javacpp.maxbytes=14G -Dorg.bytedeco.javacpp.maxphysicalbytes=14G
Maven CUDA驱动程序:
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-cuda-10.0</artifactId>
<version>1.0.0-beta4</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-10.0-platform</artifactId>
<version>1.0.0-beta4</version>
</dependency>
错误:
java.lang.OutOfMemoryError: Cannot allocate new PointerPointer(4): totalBytes = 227M, physicalBytes = 14558M
at org.bytedeco.javacpp.PointerPointer.<init> (PointerPointer.java:126)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.calculateOutputShape
...
Caused by: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (14558M) > maxPhysicalBytes (14336M)
at org.bytedeco.javacpp.Pointer.deallocator (Pointer.java:589)
at org.bytedeco.javacpp.Pointer.init (Pointer.java:125)
at org.bytedeco.javacpp.PointerPointer.allocateArray (Native Method)
at org.bytedeco.javacpp.PointerPointer.<init> (PointerPointer.java:118)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.calculateOutputShape
...
您应该做的第一件事是更新到当前版本1.0.0-beta6。
然后,您应该查看模型处理更大的批次实际需要多少内存。