我试图理解一个简单的例子,并想知道为什么我没有看到比
perf c2c
报告的更多的虚假共享。
在我的示例(矩阵乘法)中,两个实例可能会导致错误共享(常量读取除外):
由于以下资源,我预计会出现缓存行争用:
Linux 内核文档
SO线程):当使用由许多 CPU 访问(共享)的全局数据
在对数据的并发访问中,至少存在一种写操作:写/写或写/读情况。
在这个
perf c2c record -F 60000 ./a.out
分析代码并使用 perf c2c report -c tid,iaddr
报告代码时,显示的唯一共享的标语是包含原子计数器的标语,而不是保存数组 c 的标语。
我的问题是:
我是否应该在输出数组上看到错误共享
c
测试系统是 Intel Coffe Lake(无 Numa)。我是否应该期望其他代 Intel 机器以及 ARM 机器上也没有共享?
threadpool.cpp:25
表示计数器的原子递减):
=================================================
Trace Event Information
=================================================
Total records : 414185
Locked Load/Store Operations : 126
Load Operations : 164311
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 1
Loads - no mapping : 7
Load Fill Buffer Hit : 477
Load L1D hit : 84675
Load L2D hit : 7
Load LLC hit : 79115
Load Local HITM : 40
Load Remote HITM : 0
Load Remote HIT : 0
Load Local DRAM : 29
Load Remote DRAM : 0
Load MESI State Exclusive : 0
Load MESI State Shared : 29
Load LLC Misses : 29
Load access blocked by data : 0
Load access blocked by address : 0
Load HIT Local Peer : 0
Load HIT Remote Peer : 0
LLC Misses to Local DRAM : 100.0%
LLC Misses to Remote DRAM : 0.0%
LLC Misses to Remote cache (HIT) : 0.0%
LLC Misses to Remote cache (HITM) : 0.0%
Store Operations : 249874
Store - uncacheable : 0
Store - no mapping : 0
Store L1D Hit : 249829
Store L1D Miss : 45
Store No available memory level : 0
No Page Map Rejects : 4617
Unable to parse data source : 0
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 1
Load HITs on shared lines : 185
Fill Buffer Hits on shared lines : 43
L1D hits on shared lines : 66
L2D hits on shared lines : 0
LLC hits on shared lines : 76
Load hits on peer cache or nodes : 0
Locked Access on shared lines : 104
Blocked Access on shared lines : 0
Store HITs on shared lines : 785
Store L1D hits on shared lines : 785
Store No available memory level : 0
Total Merged records : 825
=================================================
c2c details
=================================================
Events : cpu/mem-loads,ldlat=30/P
: cpu/mem-stores/P
Cachelines sort on : Total HITMs
Cacheline data grouping : offset,tid,iaddr
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total --------- Stores -------- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- Load Dram ----
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss N/A FB L1 L2 LclHit LclHitm RmtHit RmtHitm Lcl Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... ........ ........
#
0 0x7fff33e60700 0 151 100.00% 40 40 0 970 185 785 785 0 0 43 66 0 36 40 0 0 0 0
=================================================
Shared Cache Line Distribution Pareto
=================================================
#
# ----- HITM ----- ------- Store Refs ------ --------- Data address --------- ---------- cycles ---------- Total cpu Shared
# Num RmtHitm LclHitm L1 Hit L1 Miss N/A Offset Node PA cnt Tid Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node
# ..... ....... ....... ....... ....... ....... .................. .... ...... ............. .................. ........ ........ ........ ....... ........ .............................. ..... ................. ....
#
----------------------------------------------------------------------
0 0 40 785 0 0 0x7fff33e60700
----------------------------------------------------------------------
0.00% 2.50% 23.44% 0.00% 0.00% 0x34 0 1 84530:a.out 0x5e87bc063dfe 0 241 130 203 1 [.] ThreadPool::QueueTask(void a.out atomic_base.h:628 0
0.00% 2.50% 24.84% 0.00% 0.00% 0x34 0 1 84533:a.out 0x5e87bc063a32 0 306 188 233 1 [.] ThreadMain(std::stop_token a.out atomic_base.h:628 0
0.00% 0.00% 25.61% 0.00% 0.00% 0x34 0 1 84532:a.out 0x5e87bc063a32 0 0 167 225 1 [.] ThreadMain(std::stop_token a.out atomic_base.h:628 0
0.00% 0.00% 26.11% 0.00% 0.00% 0x34 0 1 84534:a.out 0x5e87bc063a32 0 0 193 228 1 [.] ThreadMain(std::stop_token a.out atomic_base.h:628 0
0.00% 42.50% 0.00% 0.00% 0.00% 0x38 0 1 84533:a.out 0x5e87bc0639f0 0 132 128 33 1 [.] ThreadMain(std::stop_token a.out threadpool.cpp:25 0
0.00% 35.00% 0.00% 0.00% 0.00% 0x38 0 1 84534:a.out 0x5e87bc0639f0 0 133 121 28 1 [.] ThreadMain(std::stop_token a.out threadpool.cpp:25 0
0.00% 17.50% 0.00% 0.00% 0.00% 0x38 0 1 84532:a.out 0x5e87bc0639f0 0 124 111 20 1 [.] ThreadMain(std::stop_token a.out threadpool.cpp:25 0
Perf c2c 没有在 Coffe Lake 目标上显示更多 cachline 共享实例,但显示在其他目标(Alder Lake 笔记本电脑和 Graviton 3 实例)上共享 cachline。Intel 机器上的频率可以通过
-F X
/proc/sys/kernel/perf_event_max_sample_rate
处读取。