cuda.jit矩阵乘法崩溃

Question

我正在尝试将cuda.jit矩阵乘法写成我的线程块数量的上限，它只能是一个。而且我还知道我的乘法形式为X * Xtranspose。

    def matmul_gpu(X, Y):
    # Allocate the output matrix in GPU memory using cuda.to_device
    #
    # invoke the dot kernel with 1 threadBlock with 1024 threads
    #
    # copy the output matrix from GPU to cpu using copy_to_host()
    gpu_mat1 = cuda.to_device(X)
    gpu_mat2 = cuda.to_device(Y)
    res = np.zeros(shape=(X.shape[0], Y.shape[1]), dtype=np.float32)
    gpu_mult_res = cuda.to_device(res)
    threads_per_block = 1024
    blocks_per_grid = 1
    matmul_kernel[blocks_per_grid, threads_per_block](
        gpu_mat1, gpu_mat2, gpu_mult_res)
    mult_res = gpu_mult_res.copy_to_host()
    return mult_res


@cuda.jit
def matmul_kernel(A, B, C):
    num_of_threads = cuda.gridsize(1)
    tid = cuda.grid(1)
    rows_num = A.shape[0]
    cols_num = A.shape[1]
    step = int(np.math.ceil(num_of_threads / cols_num))
    row = int(np.math.floor(tid / cols_num))
    col = int(tid % cols_num)
    for row_start_idx in range(0, rows_num, step):
        if row_start_idx + row < rows_num and col < cols_num:
            C[row_start_idx + row, col] += A[row_start_idx + row, tid] * B[tid, col]

它对于尺寸为128,256或256,128的矩阵崩溃，并且使用回溯以该顺序抛出这些错误。

...
 Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
...
Call to cuMemFree results in UNKNOWN_CUDA_ERROR

它适用于非常大的尺寸，例如1024、2048和2048、1024，并且对于具有相同尺寸的输入非常有用，但是有时具有不同的尺寸会引发上述错误。对于几乎相等的尺寸，几乎不会抛出任何错误，除了我刚才注意到的256 * 256以外，因此应该与这些内容有关。

调试帮助代码：

# this is the comparison function - keep it as it is, don't change X or Y.
def matmul_comparison():
    X = np.random.randn(1000, 1024)
    Y = np.random.randn(1024, 1000)

    def timer(f):
        return min(timeit.Timer(lambda: f(X, Y)).repeat(3, 5))

    # print('Python:', timer(matmul_trivial)) we will not consider this since it takes infinite time :)
    #print('Numpy:', timer(np.matmul))
    #print('Numba:', timer(matmul_numba))
    print('CUDA:', timer(matmul_gpu))


if __name__ == '__main__':
    os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so'
    os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda-9.0/nvvm/libdevice/'
    matmul_comparison()

Answer 1

一些一般性评论：

除非已验证数字正确性，否则我不会声明一切正常。
我认为进行错误检查是一种很好的做法。即使使用numba python CUDA代码，cuda-memcheck工具也可以检查各种错误。即使您建议的大小正常工作，您的代码也会引发错误。
天真矩阵乘法具有相当典型的格式，并且在诸如the CUDA programming guide的许多地方都涉及。如果我正在从事此工作，则尽可能将其作为起点。
[从性能的角度来看，任意限制CUDA代码只能在1024个线程的单个线程块上运行是个坏主意。我无法想象你为什么要这么做。
尽管如此，如果我想使用任意网格排列来处理CUDA算法，则规范技术将是grid-stride loop。

关于您的代码，一些问题立即出现：

对于规范矩阵乘法，我通常希望从结果（C）矩阵而不是A矩阵得出计算范围。如果将自己限制在X*Xt的情况下，那么我认为您可以使用A。在一般情况下不是。
对我来说很明显，你有索引问题。我不会尝试将它们全部整理出来，甚至不会全部识别出来，但是我已经指出了一个问题。由于选择了网格大小，因此tid变量的范围为0..1023，并且对于以下索引模式可能无法正确设置：B[tid, col]（B的行数等于1024）。
在我看来，您有可能将多个线程写入C矩阵中的同一输出位置。 CUDA不会为您解决此问题。您不应该期望有多个线程写入相同的输出位置才能正常工作，除非您已采取步骤通过原子或经典并行归约来做到这一点。而且我不想将任何一种方法引入这个问题，因此我认为基本方法很麻烦。

也可能还有其他问题。但是由于上面的考虑3，我不想尝试修改您的代码，而宁愿从规范的朴素矩阵乘法开始，然后使用grid-stride循环。

这里是一个包含这些想法的示例：

$ cat t59.py
import numpy as np
from numba import cuda,jit


@cuda.jit
def matmul_kernel(A, B, C):
    num_of_threads = cuda.gridsize(1)
    tid = cuda.grid(1)
    rows_num = C.shape[0]
    cols_num = C.shape[1]
    idx_range = A.shape[1]
    for mid in range(tid, rows_num*cols_num, num_of_threads):
        row = mid // cols_num
        col = mid - (row*cols_num)
        my_sum = 0.0
        for idx in range(0, idx_range):
            my_sum += A[row, idx] * B[idx, col]
        C[row, col] = my_sum

def matmul_gpu(X, Y):
    # Allocate the output matrix in GPU memory using cuda.to_device
    #
    # invoke the dot kernel with 1 threadBlock with 1024 threads
    #
    # copy the output matrix from GPU to cpu using copy_to_host()
    gpu_mat1 = cuda.to_device(X)
    gpu_mat2 = cuda.to_device(Y)
    res = np.zeros(shape=(X.shape[0], Y.shape[1]), dtype=np.float32)
    gpu_mult_res = cuda.to_device(res)
    threads_per_block = 1024
    blocks_per_grid = 1
    matmul_kernel[blocks_per_grid, threads_per_block](
        gpu_mat1, gpu_mat2, gpu_mult_res)
    mult_res = gpu_mult_res.copy_to_host()
    return mult_res

wA = 256
hA = 128
wB = hA
hB = wA


mA = np.ones(shape=(hA,wA), dtype=np.float32)
mB = np.ones(shape=(hB,wB), dtype=np.float32)
mC = matmul_gpu(mA,mB)
print(mC)
$ cuda-memcheck python t59.py
========= CUDA-MEMCHECK
[[ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 ...,
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]]
========= ERROR SUMMARY: 0 errors
$

cuda.jit矩阵乘法崩溃

问题描述投票：0回答：1

1个回答

最新问题

cuda.jit矩阵乘法崩溃

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1