如何阻止nvcc重新排序时钟指令?

问题描述 投票:0回答:1

在下面的代码片段中,nvcc (CUDA 12.5)“有帮助地”重新排序了时钟语句。这会导致计时偏差 26 倍。

#include <cuda.h>
#include <stdio.h>

__device__ int dedupe(int a) {
  constexpr auto all = -1u;
  const auto start = clock64();
  const auto dupemask = __match_any_sync(all, a);
  const auto leader = __clz(dupemask);
  const auto end = clock64();
  const auto time = int(end - start);
  printf("tid: %i, dupemask: $%x, leader: %i, time: %i\n", threadIdx.x, dupemask, leader, time);
  return leader;
}

__global__ void dostuff() {
    const auto tid = threadIdx.x;
    const auto leader = dedupe(tid);
}

int main() {
  dostuff<<<1, 32>>>();
  return cudaDeviceSynchronize();
}``` 

The `dedupe` function compiles to:

CS2R R2, SR_CLOCKLO // 开始 = 时钟 S2R R8, SR_TID.X // tid = threadIdx.x MATCH.ANY R9, R8 // __match_any_sync CS2R R4, SR_CLOCKLO // 结束 = 时钟 <<-- reordered FLO.U32 R10, R9 // __clz ...


Giving me a time of 15 cycles, however the real latency of this snippet is closer to 400 cycles.  

**How do I force nvcc to never reorder the call to clock?**

*Note that replacing the `clock64()` with a volatile asm statement does not work:*

设备 uint64_t myclock() { uint64_t 结果; asm 易失性(“mov.u64 %0,%%clock64;”:“=l”(结果)::“内存”); 返回结果; }

c++ cuda volatile nvcc
1个回答
0
投票

技巧是将时钟分配和之间的所有分配标记为

volatile

  ...
  const auto volatile start = clock64();
  const auto volatile dupemask = __match_any_sync(all, a);
  const auto volatile leader = __clz(dupemask);
  const auto volatile end = clock64();
  ...
 CS2R R2, SR_CLOCKLO 
 S2R R8, SR_TID.X 
 MATCH.ANY R9, R8 
 FLO.U32 R10, R9 
 IADD3 R10, -R10, 0x1f, RZ 
 CS2R R4, SR_CLOCKLO 

这会产生以下程序集,并给出该片段的正确时间:396 个时钟周期。

请注意,仅将

clock64()
作业标记为
volatile
是不够的,有时有效,但并非总是如此。

© www.soinside.com 2019 - 2024. All rights reserved.