如何阻止nvcc重新排序时钟指令？

Question

在下面的代码片段中，nvcc (CUDA 12.5)“有帮助地”重新排序了时钟语句。这会导致计时偏差 26 倍。

#include <cuda.h>
#include <stdio.h>

__device__ int dedupe(int a) {
  constexpr auto all = -1u;
  const auto start = clock64();
  const auto dupemask = __match_any_sync(all, a);
  const auto leader = __clz(dupemask);
  const auto end = clock64();
  const auto time = int(end - start);
  printf("tid: %i, dupemask: $%x, leader: %i, time: %i\n", threadIdx.x, dupemask, leader, time);
  return leader;
}

__global__ void dostuff() {
    const auto tid = threadIdx.x;
    const auto leader = dedupe(tid);
}

int main() {
  dostuff<<<1, 32>>>();
  return cudaDeviceSynchronize();
}``` 

The `dedupe` function compiles to:

CS2R R2, SR_CLOCKLO // 开始 = 时钟 S2R R8, SR_TID.X // tid = threadIdx.x MATCH.ANY R9, R8 // __match_any_sync CS2R R4, SR_CLOCKLO // 结束 = 时钟 <<-- reordered FLO.U32 R10, R9 // __clz ...


Giving me a time of 15 cycles, however the real latency of this snippet is closer to 400 cycles.  

**How do I force nvcc to never reorder the call to clock?**

*Note that replacing the `clock64()` with a volatile asm statement does not work:*

设备 uint64_t myclock() { uint64_t 结果； asm 易失性(“mov.u64 %0，%%clock64;”：“=l”(结果)::“内存”); 返回结果； }

Answer 1

技巧是将时钟分配和之间的所有分配标记为

volatile

。

  ...
  const auto volatile start = clock64();
  const auto volatile dupemask = __match_any_sync(all, a);
  const auto volatile leader = __clz(dupemask);
  const auto volatile end = clock64();
  ...

 CS2R R2, SR_CLOCKLO 
 S2R R8, SR_TID.X 
 MATCH.ANY R9, R8 
 FLO.U32 R10, R9 
 IADD3 R10, -R10, 0x1f, RZ 
 CS2R R4, SR_CLOCKLO

这会产生以下程序集，并给出该片段的正确时间：396 个时钟周期。

请注意，仅将

clock64()

作业标记为

volatile

是不够的，有时有效，但并非总是如此。

如何阻止nvcc重新排序时钟指令？

问题描述投票：0回答：1

1个回答

最新问题

如何阻止nvcc重新排序时钟指令？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1