CUDA Thrust Kernels 可以在多个流上并行运行吗?

问题描述 投票:0回答:1

我正在尝试在不同的 CUDA 流上并行地在两个不同的设备向量上启动

thrust::fill
。然而,当我查看 NSight Systems 中的内核启动时,它们似乎是序列化的。这是我正在使用的基本示例。

#include <thrust/device_vector.h>
#include <thrust/fill.h>
#include <thrust/sort.h>
#include <thrust/transform.h>
#include <thrust/execution_policy.h>

#define gpuErrchk(ans)                        \
    {                                         \
        gpuAssert((ans), __FILE__, __LINE__); \
    }
inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
    if(code != cudaSuccess)
        {
            fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
            if(abort) exit(code);
        }
}

int main(void)
{
    cudaStream_t stream1, stream2;
    gpuErrchk(cudaStreamCreate(&stream1));
    gpuErrchk(cudaStreamCreate(&stream2));

    const size_t size = 10000000;

    int* d_test1_ptr;
    int* d_test2_ptr;
    gpuErrchk(cudaMalloc((void**)&d_test1_ptr, size * sizeof(int)));
    gpuErrchk(cudaMalloc((void**)&d_test2_ptr, size * sizeof(int)));

    thrust::device_ptr<int> d_test1(d_test1_ptr);
    thrust::device_ptr<int> d_test2(d_test2_ptr);

    for(int i = 0; i < 100; i++)
        {
            thrust::fill(thrust::cuda::par.on(stream1), d_test1, d_test1 + size, 2);
            thrust::fill(thrust::cuda::par.on(stream2), d_test2, d_test2 + size, 2);
        }

    gpuErrchk(cudaStreamSynchronize(stream1));
    gpuErrchk(cudaStreamSynchronize(stream2));

    gpuErrchk(cudaFree(d_test1_ptr));
    gpuErrchk(cudaFree(d_test2_ptr));

    gpuErrchk(cudaStreamDestroy(stream1));
    gpuErrchk(cudaStreamDestroy(stream2));

    std::cout << "Completed execution of dummy functions on different streams." << std::endl;

    return 0;
}

这是 NSight 的结果。看起来好像一直有

cudaStreamSynchronize()
的呼叫,但我不知道为什么。

NSight Image

我已经查看了获取 CUDA Thrust 以使用您选择的 CUDA 流 看来它们的发布是并行的。我什至尝试使用他们的确切代码,但内核仍在序列化。

如果您需要更多信息,请告诉我。

asynchronous concurrency cuda gpu thrust
1个回答
0
投票

Thrust 在过去 5 年里经历了一些重大变化。 其中一些已记录在此处

最接近的问题是,您对

thrust::fill
的调用也会发出
cudaStreamSynchronize()
,这可以在您的图形分析器时间线输出/附件中看到。 您还可以使用
nsys profile --stats=true ...
运行代码,CLI 输出将指示对
cudaStreamSynchronize()
的 202 次调用。 其中两个用于代码中的显式调用,另外 200 个对应于每个推力算法的启动。

如果我们按照评论中的建议使用执行策略的

nosync
变体“修复”该问题,我们可以在分析器中看到少量重叠:

# cat t234.cu
#include <thrust/device_vector.h>
#include <thrust/fill.h>
#include <thrust/sort.h>
#include <thrust/transform.h>
#include <thrust/execution_policy.h>

#define gpuErrchk(ans)                        \
    {                                         \
        gpuAssert((ans), __FILE__, __LINE__); \
    }
inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
    if(code != cudaSuccess)
        {
            fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
            if(abort) exit(code);
        }
}

int main(void)
{
    cudaStream_t stream1, stream2;
    gpuErrchk(cudaStreamCreate(&stream1));
    gpuErrchk(cudaStreamCreate(&stream2));

    const size_t size = 10000000;

    int* d_test1_ptr;
    int* d_test2_ptr;
    gpuErrchk(cudaMalloc((void**)&d_test1_ptr, size * sizeof(int)));
    gpuErrchk(cudaMalloc((void**)&d_test2_ptr, size * sizeof(int)));

    thrust::device_ptr<int> d_test1(d_test1_ptr);
    thrust::device_ptr<int> d_test2(d_test2_ptr);

    for(int i = 0; i < 100; i++)
        {
            thrust::fill(thrust::cuda::par_nosync.on(stream1), d_test1, d_test1 + size, 2);
            thrust::fill(thrust::cuda::par_nosync.on(stream2), d_test2, d_test2 + size, 2);
        }

    gpuErrchk(cudaStreamSynchronize(stream1));
    gpuErrchk(cudaStreamSynchronize(stream2));

    gpuErrchk(cudaFree(d_test1_ptr));
    gpuErrchk(cudaFree(d_test2_ptr));

    gpuErrchk(cudaStreamDestroy(stream1));
    gpuErrchk(cudaStreamDestroy(stream2));

    std::cout << "Completed execution of dummy functions on different streams." << std::endl;

    return 0;
}
root@hpe-dl385-gen10-005:~/bobc# nvcc -o t234 t234.cu
root@hpe-dl385-gen10-005:~/bobc# nsys nvprof --print-gpu-trace ./t234
WARNING: t234 and any of its children processes will be profiled.

Completed execution of dummy functions on different streams.
Generating '/tmp/nsys-report-7d14.qdstrm'
[1/3] [========================100%] report59.nsys-rep
[2/3] [========================100%] report59.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

 Start (ns)   Duration (ns)  CorrId   GrdX   GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd     Device      Ctx  Strm                                                  Name
 -----------  -------------  ------  ------  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------  ---  ----  ----------------------------------------------------------------------------------------------------
 677,159,755         44,577     135  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 677,201,836        155,392     149  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 677,347,884        163,072     163  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 677,504,076        165,632     177  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 677,661,740        165,760     191  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
//////SNIP///////////////////
 700,211,075        168,096   2,207  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 700,368,355        167,424   2,221  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 700,525,379        169,248   2,235  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 700,683,012        167,391   2,249  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 700,841,348        168,832   2,263  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 700,998,852        166,368   2,277  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 701,155,044        169,248   2,291  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 701,312,356        167,904   2,305  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 701,471,268        166,912   2,319  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 701,627,076        172,896   2,333  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 701,784,772        171,105   2,347  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 701,942,789        166,144   2,361  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 702,100,421        167,936   2,375  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 702,258,565        168,800   2,389  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 702,415,269        169,056   2,403  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 702,572,358        167,200   2,417  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 702,729,701        167,649   2,431  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 702,887,366        168,992   2,445  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 703,044,326        166,208   2,459  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 703,202,214        167,264   2,473  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 703,359,750        167,424   2,487  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 703,516,327        168,511   2,501  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 703,672,743        169,888   2,515  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 703,831,047        169,664   2,529  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 703,988,775        167,744   2,543  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 704,145,543        166,816   2,557  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 704,303,943        164,608   2,571  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 704,461,127        164,993   2,585  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 704,618,663        167,584   2,599  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 704,775,592        167,008   2,613  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 704,932,712        171,040   2,627  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 705,089,928        166,592   2,641  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 705,248,744        169,024   2,655  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 705,405,544        166,240   2,669  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 705,563,016        168,321   2,683  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 705,720,136        166,785   2,697  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 705,877,736        167,201   2,711  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 706,034,121        170,528   2,725  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 706,192,201        168,896   2,739  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 706,350,569        169,153   2,753  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 706,507,722        168,191   2,767  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 706,666,122        169,120   2,781  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 706,824,394        167,040   2,795  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 706,981,066        171,104   2,809  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 707,139,466        170,432   2,823  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 707,297,194        165,248   2,837  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 707,454,475        166,464   2,851  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 707,612,715        167,327   2,865  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 707,769,835        167,232   2,879  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 707,927,755        166,272   2,893  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 708,085,387        167,488   2,907  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
 708,242,027        164,672   2,921  19,532     1     1   256     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…

Generated:
    /root/bobc/report59.nsys-rep
    /root/bobc/report59.sqlite
#

我已经削减了一些探查器输出,但是查看最后几个内核调用,我们发现倒数第二个调用在时间线上的 708,085,387 ns 处开始,持续时间为 167,488ns,这意味着该内核在时间线上为 708,252,875ns,这是在下一个内核在时间线上的 708,242,072ns 开始之后,因此大约有 10us 的重叠。

您看不到更多重叠的原因之一是因为每个内核在其大部分执行持续时间内都可以填充 GPU,这是由于内核启动的网格尺寸较大、输入向量长度较大所致(10,000,000)。 Thrust 倾向于采用将每个元素关联到一个线程的

for_each
策略进行并行化,因此 10,000,000 个线程足以填充任何当前的 GPU,而不会为另一个内核的执行留下“空间”。 (探查器输出显示每个推力算法内核使用近 20,000 个块,每个块有 256 个线程。这对我来说意味着每个线程正在处理 2 个元素。)这是人们尝试见证内核并发性时的常见问题。

您可以尝试通过减少线程(即减少每个向量的元素)来缩小有效的内核启动,看看是否可以看到更多的重叠。 然后,您将开始解决这样的问题:最终内核启动变得如此之短,以至于内核启动延迟(大约 10us)消除了很多见证内核重叠的机会。

要点是,很难看到两个这样的内核之间有很多内核重叠,而这两个内核几乎不为每个元素做任何工作。

© www.soinside.com 2019 - 2024. All rights reserved.