当不同线程写入同一个变量时，为什么我没有看到更多的错误共享？

Question

我试图理解一个简单的例子，并想知道为什么我没有看到比

perf c2c

报告的更多的虚假共享。

在我的示例（矩阵乘法）中，两个实例可能会导致错误共享（常量读取除外）：

线程池中的不同线程会递减原子计数器，以了解它们应该处理哪些数据。
线程以不同的偏移量（步幅为 4 字节）将其计算写入同一数组。重要的是，数组位置“永远只有一次写入”。其他人不需要这些信息。

由于以下资源，我预计会出现缓存行争用：

Linux 内核文档

由许多 CPU 访问（共享）的全局数据

在对数据的并发访问中，至少存在一种写操作：写/写或写/读情况。

在这个

SO线程

基于该声明，当不同线程仅写入同一缓存行的各个位置时，我希望看到错误共享。联合写入数组的相关代码如下（完整代码

）：

void fma_f32_single(const float* __restrict__ aptr, const float* __restrict__ bptr, size_t M, size_t N, size_t K, float* __restrict__ cptr) { float c0{0}; for (size_t i = 0; i < K; ++i) { c0 += (*(aptr + i)) * (*(bptr + N * i)); } *cptr = c0; } struct pthreadpool_context { const float* __restrict__ a; const float* __restrict__ b; float* __restrict__ c; size_t M; size_t N; size_t K; std::vector<std::pair<size_t, size_t>> indices; }; void work(void* ctx, size_t i) { const pthreadpool_context* context = (pthreadpool_context*)ctx; const auto [row, col] = context->indices[i]; const float* aptr = context->a + row * context->K; const float* bptr = context->b + col; float* cptr = context->c + row * context->N + col; // Increasing col by one is just a four byte different on c. // Threads write their output to cptr. fma_f32_single(aptr, bptr, context->M, context->N, context->K, cptr); }

当使用

perf c2c record -F 60000 ./a.out

分析代码并使用

perf c2c report -c tid,iaddr

 报告代码时，显示的唯一共享的标语是包含原子计数器的标语，而不是保存数组 c 的标语。

我的问题是：

我是否应该在输出数组上看到错误共享

c

测试系统是 Intel Coffe Lake（无 Numa）。我是否应该期望其他代 Intel 机器以及 ARM 机器上也没有共享？

如果相关的话，这是完整的性能输出（代码片段

threadpool.cpp:25表示计数器的原子递减）：

=================================================
            Trace Event Information              
=================================================
  Total records                     :     414185
  Locked Load/Store Operations      :        126
  Load Operations                   :     164311
  Loads - uncacheable               :          0
  Loads - IO                        :          0
  Loads - Miss                      :          1
  Loads - no mapping                :          7
  Load Fill Buffer Hit              :        477
  Load L1D hit                      :      84675
  Load L2D hit                      :          7
  Load LLC hit                      :      79115
  Load Local HITM                   :         40
  Load Remote HITM                  :          0
  Load Remote HIT                   :          0
  Load Local DRAM                   :         29
  Load Remote DRAM                  :          0
  Load MESI State Exclusive         :          0
  Load MESI State Shared            :         29
  Load LLC Misses                   :         29
  Load access blocked by data       :          0
  Load access blocked by address    :          0
  Load HIT Local Peer               :          0
  Load HIT Remote Peer              :          0
  LLC Misses to Local DRAM          :      100.0%
  LLC Misses to Remote DRAM         :        0.0%
  LLC Misses to Remote cache (HIT)  :        0.0%
  LLC Misses to Remote cache (HITM) :        0.0%
  Store Operations                  :     249874
  Store - uncacheable               :          0
  Store - no mapping                :          0
  Store L1D Hit                     :     249829
  Store L1D Miss                    :         45
  Store No available memory level   :          0
  No Page Map Rejects               :       4617
  Unable to parse data source       :          0

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :          1
  Load HITs on shared lines         :        185
  Fill Buffer Hits on shared lines  :         43
  L1D hits on shared lines          :         66
  L2D hits on shared lines          :          0
  LLC hits on shared lines          :         76
  Load hits on peer cache or nodes  :          0
  Locked Access on shared lines     :        104
  Blocked Access on shared lines    :          0
  Store HITs on shared lines        :        785
  Store L1D hits on shared lines    :        785
  Store No available memory level   :          0
  Total Merged records              :        825

=================================================
                 c2c details                     
=================================================
  Events                            : cpu/mem-loads,ldlat=30/P
                                    : cpu/mem-stores/P
  Cachelines sort on                : Total HITMs
  Cacheline data grouping           : offset,tid,iaddr

=================================================
           Shared Data Cache Line Table          
=================================================
#
#        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
#
      0      0x7fff33e60700     0     151  100.00%       40       40        0      970      185      785      785        0        0       43       66        0        36       40         0        0         0         0

=================================================
      Shared Cache Line Distribution Pareto      
=================================================
#
#        ----- HITM -----  ------- Store Refs ------  --------- Data address ---------                                     ---------- cycles ----------    Total       cpu                                  Shared                         
#   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt            Tid        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol  Object        Source:Line  Node
# .....  .......  .......  .......  .......  .......  ..................  ....  ......  .............  ..................  ........  ........  ........  .......  ........  ..............................  .....  .................  ....
#
  ----------------------------------------------------------------------
      0        0       40      785        0        0      0x7fff33e60700
  ----------------------------------------------------------------------
           0.00%    2.50%   23.44%    0.00%    0.00%                0x34     0       1    84530:a.out      0x5e87bc063dfe         0       241       130      203         1  [.] ThreadPool::QueueTask(void  a.out  atomic_base.h:628   0
           0.00%    2.50%   24.84%    0.00%    0.00%                0x34     0       1    84533:a.out      0x5e87bc063a32         0       306       188      233         1  [.] ThreadMain(std::stop_token  a.out  atomic_base.h:628   0
           0.00%    0.00%   25.61%    0.00%    0.00%                0x34     0       1    84532:a.out      0x5e87bc063a32         0         0       167      225         1  [.] ThreadMain(std::stop_token  a.out  atomic_base.h:628   0
           0.00%    0.00%   26.11%    0.00%    0.00%                0x34     0       1    84534:a.out      0x5e87bc063a32         0         0       193      228         1  [.] ThreadMain(std::stop_token  a.out  atomic_base.h:628   0
           0.00%   42.50%    0.00%    0.00%    0.00%                0x38     0       1    84533:a.out      0x5e87bc0639f0         0       132       128       33         1  [.] ThreadMain(std::stop_token  a.out  threadpool.cpp:25   0
           0.00%   35.00%    0.00%    0.00%    0.00%                0x38     0       1    84534:a.out      0x5e87bc0639f0         0       133       121       28         1  [.] ThreadMain(std::stop_token  a.out  threadpool.cpp:25   0
           0.00%   17.50%    0.00%    0.00%    0.00%                0x38     0       1    84532:a.out      0x5e87bc0639f0         0       124       111       20         1  [.] ThreadMain(std::stop_token  a.out  threadpool.cpp:25   0

Perf c2c 没有在 Coffe Lake 目标上显示更多 cachline 共享实例，但显示在其他目标（Alder Lake 笔记本电脑和 Graviton 3 实例）上共享 cachline。

Answer 1

获得更详细统计数据的关键是采样率。 Alder Lake 笔记本电脑的最大采样频率为 120HZ。（我不确定 Graviton 3 上的采样是如何工作的。）

Intel 机器上的频率可以通过

-F X

选项影响，系统范围的最大频率可以在

/proc/sys/kernel/perf_event_max_sample_rate

 处读取。

当不同线程写入同一个变量时，为什么我没有看到更多的错误共享？

问题描述投票：0回答：1

1个回答

最新问题

当不同线程写入同一个变量时，为什么我没有看到更多的错误共享？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1