cuFFT 分析问题

Question

我正在尝试获取 cuFFT 库调用的分析数据，例如计划和执行。我正在使用 nvprof（命令行分析工具），带有“--print-api-trace”选项。它打印除 cuFFT api 之外的所有 api 的时间。我需要更改任何标志才能获取 cuFFT 分析数据吗？或者我需要使用事件并衡量自己？

Answer 1

根据 nvprof 文档，api-trace-mode：

API 跟踪模式显示所有 CUDA 运行时和驱动程序 API 调用的时间线

cuFFT 既不是 CUDA 运行时 API 也不是 CUDA 驱动程序 API。它是 FFT 例程库，其文档位于here。

当然，您仍然可以使用 nvprof、命令行分析器或可视化分析器来收集有关 cuFFT 如何使用 GPU 的数据。

Answer 2

NVIDIA 的 NSight 系统 (nsys) 和 NSight 计算 (ncu) 是您想要了解的较新工具。两者均提供 GUI 和 CLI 版本。

NSight 系统

该工具允许以较低的开销进行系统范围的基准测试，使开发人员能够查看性能统计数据并识别瓶颈。统计信息包括主机和设备之间的内存传输、内核运行时以及流和设备同步时间。

运行此命令以启动分析会话：

nsys profile --output <report-output-file> --gpu-metrics-devices=all <your-executable(s)>

分析会话结束后，将生成一个

nsys-rep

程序并可用于分析。您可以将它们导入 GUI，或运行以下命令：

nsys stats <your-.nsys-rep-file>

这是输出的片段：

** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

Time (%)  Total Time (ns)  Instances    Avg (ns)      Med (ns)     Min (ns)    Max (ns)    StdDev (ns)                                                   Name                                                
--------  ---------------  ---------  ------------  ------------  ----------  -----------  ------------  ----------------------------------------------------------------------------------------------------
    70.1    4,178,454,702         72  58,034,093.1  57,898,957.5   1,244,084  265,042,382  52,417,885.5  fft_step_2_2(Complex *, Complex *, int, int, bool)                                                  
    17.3    1,033,805,010         72  14,358,402.9  13,011,089.0  12,910,621   40,751,504   5,340,200.6  calculate_w(Complex *, int)                                                                         
     3.8      228,587,479         72   3,174,826.1   1,063,724.5   1,056,748  152,983,858  17,903,830.7  copy_first_half_2(Complex *, int)                                                                   
     2.0      118,304,890         72   1,643,123.5   1,553,585.0   1,251,029    2,656,751     414,602.5  fft_step_2(Complex *, Complex *, int, int, bool)                                                    
     1.9      113,549,795          2  56,774,897.5  56,774,897.5  38,970,469   74,579,326  25,179,264.3  copyDoubleToComplex(double *, Complex *, int)

NSight 计算

与 NSight 系统相比，该工具的基准测试级别要低得多：缓存命中/未命中、内存访问统计、块利用率等。因此，它需要大量开销，并且每个分析会话都会更慢。

要启动分析会话：

ncu -o <report-output-file> <your-executable>

会话结束后，将生成一个

.ncu-rep

文件。您可以将其导入 GUI 程序，或使用

ncu -i

:

提取统计数据

ncu -i <your-.ncu-rep-file>

输出示例：

complexMultiplyKernel(double2 *, double2 *, int) (256, 1, 1)x(32, 1, 1), Context 1, Stream 14, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         6.73
    SM Frequency                    Ghz         1.36
    Elapsed Cycles                cycle    2,107,144
    Memory Throughput                 %        80.69
    DRAM Throughput                   %        80.69
    Duration                         ms         1.54
    L1/TEX Cache Throughput           %        14.76
    L2 Cache Throughput               %        28.73
    SM Active Cycles              cycle 2,043,466.81
    Compute (SM) Throughput           %        23.53
    ----------------------- ----------- ------------

    INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To   
          further improve performance, work will likely need to be shifted from the most utilized to another unit.      
          Start by analyzing DRAM in the Memory Workload Analysis section.                                              

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    32
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                    256
    Registers Per Thread             register/thread              28
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block        byte/block               0
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    # SMs                                         SM              68
    Threads                                   thread           8,192
    Uses Green Context                                             0
    Waves Per SM                                                0.24
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           64
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block           32
    Theoretical Active Warps per SM        warp           16
    Theoretical Occupancy                     %           50
    Achieved Occupancy                        %        11.83
    Achieved Active Warps Per SM           warp         3.78
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 76.35%                                                                                    
          The difference between calculated theoretical (50.0%) and measured achieved occupancy (11.8%) can be the      
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can   
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices   
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on     
          optimizing occupancy.                                                                                         
    ----- --------------------------------------------------------------------------------------------------------------
    OPT   Est. Local Speedup: 50%                                                                                       
          The 4.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the       
          hardware maximum of 8. This kernel's theoretical occupancy (50.0%) is limited by the number of blocks that    
          can fit on the SM. This kernel's theoretical occupancy (50.0%) is limited by the required amount of shared    
          memory.

Answer 3

它可以工作了..我没有使用 nvprof，而是使用了 CUDA_PROFILE 环境变量。

cuFFT 分析问题

问题描述投票：0回答：3

3个回答

NSight 系统

NSight 计算

最新问题

cuFFT 分析问题

问题描述 投票：0回答：3

3个回答

NSight 系统

NSight 计算

最新问题

问题描述投票：0回答：3