在下面的代码片段中,nvcc (CUDA 12.5)“有帮助地”重新排序了时钟语句。这会导致计时偏差 26 倍。
#include <cuda.h>
#include <stdio.h>
__device__ int dedupe(int a) {
constexpr auto all = -1u;
const auto start = clock64();
const auto dupemask = __match_any_sync(all, a);
const auto leader = __clz(dupemask);
const auto end = clock64();
const auto time = int(end - start);
printf("tid: %i, dupemask: $%x, leader: %i, time: %i\n", threadIdx.x, dupemask, leader, time);
return leader;
}
__global__ void dostuff() {
const auto tid = threadIdx.x;
const auto leader = dedupe(tid);
}
int main() {
dostuff<<<1, 32>>>();
return cudaDeviceSynchronize();
}```
The `dedupe` function compiles to:
CS2R R2, SR_CLOCKLO // 开始 = 时钟 S2R R8, SR_TID.X // tid = threadIdx.x MATCH.ANY R9, R8 // __match_any_sync CS2R R4, SR_CLOCKLO // 结束 = 时钟 <<-- reordered FLO.U32 R10, R9 // __clz ...
Giving me a time of 15 cycles, however the real latency of this snippet is closer to 400 cycles.
**How do I force nvcc to never reorder the call to clock?**
*Note that replacing the `clock64()` with a volatile asm statement does not work:*
设备 uint64_t myclock() { uint64_t 结果; asm 易失性(“mov.u64 %0,%%clock64;”:“=l”(结果)::“内存”); 返回结果; }
技巧是将时钟分配和之间的所有分配标记为
volatile
。
...
const auto volatile start = clock64();
const auto volatile dupemask = __match_any_sync(all, a);
const auto volatile leader = __clz(dupemask);
const auto volatile end = clock64();
...
CS2R R2, SR_CLOCKLO
S2R R8, SR_TID.X
MATCH.ANY R9, R8
FLO.U32 R10, R9
IADD3 R10, -R10, 0x1f, RZ
CS2R R4, SR_CLOCKLO
这会产生以下程序集,并给出该片段的正确时间:396 个时钟周期。
请注意,仅将
clock64()
作业标记为 volatile
是不够的,有时有效,但并非总是如此。