我正在尝试获取 cuFFT 库调用的分析数据,例如计划和执行。我正在使用 nvprof(命令行分析工具),带有“--print-api-trace”选项。它打印除 cuFFT api 之外的所有 api 的时间。我需要更改任何标志才能获取 cuFFT 分析数据吗? 或者 我需要使用事件并衡量自己?
根据 nvprof 文档,api-trace-mode:
API 跟踪模式显示所有 CUDA 运行时和驱动程序 API 调用的时间线
cuFFT 既不是 CUDA 运行时 API 也不是 CUDA 驱动程序 API。 它是 FFT 例程库,其文档位于here。
当然,您仍然可以使用 nvprof、命令行分析器或可视化分析器来收集有关 cuFFT 如何使用 GPU 的数据。
NVIDIA 的 NSight 系统 (nsys) 和 NSight 计算 (ncu) 是您想要了解的较新工具。两者均提供 GUI 和 CLI 版本。
该工具允许以较低的开销进行系统范围的基准测试,使开发人员能够查看性能统计数据并识别瓶颈。统计信息包括主机和设备之间的内存传输、内核运行时以及流和设备同步时间。
运行此命令以启动分析会话:
nsys profile --output <report-output-file> --gpu-metrics-devices=all <your-executable(s)>
分析会话结束后,将生成一个
nsys-rep
程序并可用于分析。您可以将它们导入 GUI,或运行以下命令:
nsys stats <your-.nsys-rep-file>
这是输出的片段:
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ----------- ------------ ----------------------------------------------------------------------------------------------------
70.1 4,178,454,702 72 58,034,093.1 57,898,957.5 1,244,084 265,042,382 52,417,885.5 fft_step_2_2(Complex *, Complex *, int, int, bool)
17.3 1,033,805,010 72 14,358,402.9 13,011,089.0 12,910,621 40,751,504 5,340,200.6 calculate_w(Complex *, int)
3.8 228,587,479 72 3,174,826.1 1,063,724.5 1,056,748 152,983,858 17,903,830.7 copy_first_half_2(Complex *, int)
2.0 118,304,890 72 1,643,123.5 1,553,585.0 1,251,029 2,656,751 414,602.5 fft_step_2(Complex *, Complex *, int, int, bool)
1.9 113,549,795 2 56,774,897.5 56,774,897.5 38,970,469 74,579,326 25,179,264.3 copyDoubleToComplex(double *, Complex *, int)
与 NSight 系统相比,该工具的基准测试级别要低得多:缓存命中/未命中、内存访问统计、块利用率等。因此,它需要大量开销,并且每个分析会话都会更慢。
要启动分析会话:
ncu -o <report-output-file> <your-executable>
会话结束后,将生成一个
.ncu-rep
文件。您可以将其导入 GUI 程序,或使用 ncu -i
: 提取统计数据
ncu -i <your-.ncu-rep-file>
输出示例:
complexMultiplyKernel(double2 *, double2 *, int) (256, 1, 1)x(32, 1, 1), Context 1, Stream 14, Device 0, CC 7.5
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 6.73
SM Frequency Ghz 1.36
Elapsed Cycles cycle 2,107,144
Memory Throughput % 80.69
DRAM Throughput % 80.69
Duration ms 1.54
L1/TEX Cache Throughput % 14.76
L2 Cache Throughput % 28.73
SM Active Cycles cycle 2,043,466.81
Compute (SM) Throughput % 23.53
----------------------- ----------- ------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing DRAM in the Memory Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 32
Function Cache Configuration CachePreferNone
Grid Size 256
Registers Per Thread register/thread 28
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
# SMs SM 68
Threads thread 8,192
Uses Green Context 0
Waves Per SM 0.24
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 64
Block Limit Shared Mem block 16
Block Limit Warps block 32
Theoretical Active Warps per SM warp 16
Theoretical Occupancy % 50
Achieved Occupancy % 11.83
Achieved Active Warps Per SM warp 3.78
------------------------------- ----------- ------------
OPT Est. Local Speedup: 76.35%
The difference between calculated theoretical (50.0%) and measured achieved occupancy (11.8%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
----- --------------------------------------------------------------------------------------------------------------
OPT Est. Local Speedup: 50%
The 4.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the
hardware maximum of 8. This kernel's theoretical occupancy (50.0%) is limited by the number of blocks that
can fit on the SM. This kernel's theoretical occupancy (50.0%) is limited by the required amount of shared
memory.
它可以工作了..我没有使用 nvprof,而是使用了 CUDA_PROFILE 环境变量。