顺序访问大尺寸向量时是否可以减少缓存未命中？

Question

当按顺序访问一个大数组时，我会尝试减少

cache miss

。数组大小约为1000万。不知道能不能尽量减少

cache miss

？

有两个数组：一个是从数据文件加载的原始数组（

ds_append_

），另一个（

ds_

）将从第一个开始更新。示例数据类型定义如下：

/**
 * @brief Raw sample data structure which loads from the file.
 */
struct sample_data_app_t {
  point_double_t sample_data_{};  //!< sample data's raw data.
  int32_t cid_{};                 //!< sample data's Channel's ID.
  int32_t dbg_blk_id_{-1}, sid_{};
};

/**
 * @brief Sample data structure.
 * @details Sample data structure contains sample point readed from the file,
 * mapped-xy calculated by the R/W configures, and display-xy calculated by the
 * chart configures. @see ch_config_t
 */
struct sample_data_t {
  int32_t cid_{};                //!< sample data's Channel's ID (0 ~ 8).
  point_double_t mapped_pos_{};  //!< sample data's mapped position.
  sample_data_app_t* aptr_{};    //!< sample data's append pointer.
};

/**
 * @brief Sample dataset structure.
 * @details Sample dataset structure contains a vector of sample data.
 */
struct sample_dataset_t {
  std::vector<sample_data_t> ds_;                   //!< sample dataset's sample data vector.
  std::shared_ptr<sample_data_app_t[]> ds_append_;  //!< sample dataset's sample data append vector.
  size_t array_size_{0};                            //!< sample dataset's sample data array size.
};

以下代码片段响应来自

ds_

的更新

ds_append_

:

  auto update_func = [&](sample_data_t& sample) {
      const auto cid = sample.cid_; // channel index 0 ~ 8
      const auto& ch_config = chConfigs[cid];
      const auto& sample_range = sampleDataConfig.ch_limits_[cid];

      const auto ax = sample.aptr_->sample_data_.x_;
      const auto ay = sample.aptr_->sample_data_.y_;

      const auto slen = sample_range.y_max_ - sample_range.y_min_;
      const auto ys = (ay - sample_range.y_min_) * 256.0 / slen;

      const double yy1 = std::abs(ch_config.max_spc_) * ch_config.cos_x_ * ys / 256.0;
      const double mapped_y = ch_config.os_y_ + yy1 * ch_config.scale_y_;

      const double mapped_x = ch_config.os_x_ + ax * ch_config.unit_scale_x_;
      sample.mapped_pos_ = {mapped_x, mapped_y};
    };

    const auto data_num = ds.ds_.size();
    const auto thd_num = std::thread::hardware_concurrency();
    if (data_num > kParallelThreshold) {
      tbb::parallel_for(
        tbb::blocked_range<size_t>(0, data_num, data_num / thd_num),
        [&](const tbb::blocked_range<size_t>& r) {
          for (size_t i = r.begin(); i != r.end(); i++) {
            update_func(ds.ds_[i]);
          }
        },
        tbb::static_partitioner{});
    } else {
      std::for_each(ds.ds_.begin(), ds.ds_.end(), update_func);
    }

注1：每个
```
cid
```
的
```
sample
```
从0到8随机；

注2：

chConfigs[]

和

sampleDataConfig.ch_limits_[]

的长度为9。

测试中有一些

memory bound

。其中之一是

const auto cid = sample.cid_; // channel index 0 ~ 8
const auto& ch_config = chConfigs[cid];
const auto& sample_range = sampleDataConfig.ch_limits_[cid];

我很困惑为什么地址更新指令

lea (%r8, %r8, 2), %r8

要这么高

memory bound

？

测试代码仓库：github -- perf_tests -- test_array_update

Answer 1

我刚刚通过分析器运行此代码，您的代码正在达到最大 DRAM 带宽，例如本机上的 30 GB/s，DRAM 棒每秒可以传输的数据量确实有限制。

最明显的解决方案是花钱解决问题并使用具有更多 RAM 通道和更快 DRAM 内存条的计算机。（仅供参考 GPU 具有更快的 DRAM）

从代码的角度来看，您需要为每个字节的读取或写入做更多的工作，这意味着缓存友好，您需要在每次迭代中链接多个操作/转换，而不是使用多个连续的

parallel_for

循环。

可以使用 C++20 范围构建数据处理管道来编写更多缓存友好的代码，而不是执行多个原始循环，您甚至可以在从文件读取数据时处理数据，而不是将其保存到中间向量， tbb 甚至有一个从文件读取数据时并行处理数据的示例

还建议您以小批量（例如 8 到 128 个元素）的方式执行该管道，以使用 simd，而不会使 L1 缓存失效。

顺序访问大尺寸向量时是否可以减少缓存未命中？

问题描述投票：0回答：1

1个回答

最新问题

顺序访问大尺寸向量时是否可以减少缓存未命中？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1