安装新的 CUDA 4.0 驱动程序和 SDK 后,许多 SDK 测试失败(例如
fastWalshTransform, matrixMul, reduction
)。这就是./deviceQuery
:
Device 0: "GeForce GTX 570"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1279 MBytes (1341325312 bytes)
(15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores
GPU Clock Speed: 1.57 GHz
Memory Clock rate: 2100.00 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
例如
reduction
的输出是:
=>
FAILED
.
解决方案:它曾经是(现在仍然是)硬件问题(驱动程序更新不能解决问题)。也许是一些内存问题,但很常见。我们有几张 NVIDIA 卡显示了该问题(甚至 Tesla!)。到目前为止,我们找到的唯一解决方案是重新启动机器或稍微提高电压。
这曾经是(现在仍然是)硬件问题(驱动程序更新不能解决问题)。也许是一些内存问题,但很常见。我们有几张 NVIDIA 卡显示了该问题(甚至 Tesla!)。到目前为止,我们找到的唯一解决方案是重新启动机器或稍微提高电压。
尝试这个代码,也许它有效。
#include <iostream>
#include <vector>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#define threads 10
#define numbers 20
__global__ void calculate(int* rez, int* max) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int step = blockDim.x * gridDim.x;
for (int i = idx; i < numbers; i += step) {
rez[i] = i * i;
atomicMax(max, rez[i]);
}
}
int main(){
int size = numbers * sizeof(int);
int hostRez[numbers];
int hostMax = 0;
int* devRez;
int* devMax;
cudaMalloc((void**)&devRez, size);
cudaMalloc((void**)&devMax, sizeof(int));
cudaMemcpy(devMax, &hostMax, sizeof(int), cudaMemcpyHostToDevice);
calculate << < 1, threads >> > (devRez, devMax);
cudaMemcpy(&hostMax, devMax, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&hostRez, devRez, size, cudaMemcpyDeviceToHost);
for (int i = 0; i < numbers; i++) {
std::cout << hostRez[i] << std::endl;
}
std::cout << hostMax << std::endl;
cudaFree(devRez);
cudaFree(devMax);
return 0;
}