为什么 PyTorch C++ 扩展比其等效的 numba 版本慢得多？

Question

我一直在尝试各种选项来加速 PyTorch 中的一些 for 循环逻辑。执行此操作的两个明显选项是使用 numba 或编写自定义 C++ 扩展。

作为示例，我从数字信号处理中选择了“可变长度延迟线”。使用简单的 Python for 循环可以编写简单但效率低下的代码：

def delay_line(samples, delays):
    """
    :param samples: Float tensor of shape (N,)
    :param delays: Int tensor of shape (N,)
    
    The goal is basically to mix each `samples[i]` with the delayed sample
    specified by a per-sample `delays[i]`.
    """
    for i in range(len(samples)):
        delay = int(delays[i].item())
        index_delayed = i - delay
        if index_delayed < 0:
            index_delayed = 0

        samples[i] = 0.5 * (samples[i] + samples[index_delayed])

知道 for 循环在 Python 中的执行情况有多糟糕，我希望通过在 C++ 中实现相同的循环可以获得更好的性能。按照教程，我想出了从Python到C++的直译：

void delay_line(torch::Tensor samples, torch::Tensor delays) {

  int64_t input_size = samples.size(-1);

  for (int64_t i = 0; i < input_size; ++i) {
    int64_t delay = delays[i].item<int64_t>();
    int64_t index_delayed = i - delay;
    if (index_delayed < 0) {
      index_delayed = 0;
    }

    samples[i] = 0.5 * (samples[i] + samples[index_delayed]);
  }
}

我还采用了 Python 函数并将其包装到各种 jit 装饰器中，以获得该函数的 numba 和 torchscript 版本（有关 numba 包装的详细信息，请参阅我的其他 answer）。然后，我对所有版本执行了基准测试，这还取决于张量是驻留在 CPU 还是 GPU 上。结果相当令人惊讶：

╭──────────────┬──────────┬────────────────────╮
│ Method       │ Device   │   Median time [ms] │
├──────────────┼──────────┼────────────────────┤
│ plain_python │ CPU      │             13.481 │
│ torchscript  │ CPU      │              6.318 │
│ numba        │ CPU      │              0.016 │
│ cpp          │ CPU      │              9.056 │
│ plain_python │ GPU      │             45.412 │
│ torchscript  │ GPU      │             47.809 │
│ numba        │ GPU      │              0.236 │
│ cpp          │ GPU      │             31.145 │
╰──────────────┴──────────┴────────────────────╯

_{注意：样本缓冲区大小固定为1024；结果是 100 次执行的中位数，以忽略初始 jit 开销中的工件；输入数据创建并将其移动到设备不包括在测量范围内；完整的基准测试脚本 gist}

最显着的结果：C++ 变体似乎出奇地慢。 numba 快两个数量级的事实表明问题确实可以更快地解决。事实上，C++ 变体仍然非常接近众所周知的缓慢的 Python for 循环，这可能表明有些事情不太正确。

我想知道什么可以解释 C++ 扩展性能不佳的原因。第一个想到的就是缺少优化。不过，我已经确保编译使用了优化。从

-O2

切换到

-O3

也没有什么区别。

为了隔离 pybind11 函数调用的开销，我用空函数体替换了 C++ 函数，即不执行任何操作。这将时间减少到 2-3 μs，这意味着时间确实花费在该特定函数体中。

有什么想法为什么我会观察到如此糟糕的性能吗？我可以在 C++ 方面做些什么来匹配 numba 实现的性能吗？

额外问题：GPU 版本是否会比 CPU 版本慢很多？

Answer 1

我意识到这是一个较老的问题，但我想为那些最终来到这里寻求提高 C++ 扩展速度的人提供答案。正如 github 问题中提到的，问题在于

torch::Tensor::operator[]

比您预期的要慢，因为它需要获取数据并将其转换为相关类型，这比典型的

std::vector::operator[]

慢。解决方案是直接访问Tensor中的原始数据。

对于像本例这样的连续张量，这并不太困难：

#include <span>
#include <torch/extension.h>

void delay_line_forward(torch::Tensor samples, torch::Tensor delays) {

  const int64_t input_size = samples.size(-1);

  assert(samples.is_contiguous() && delays.is_contiguous());
  std::span<float> samples_span(samples.data_ptr<float>(), input_size);
  std::span<float> delays_span(delays.data_ptr<float>(), input_size);
  
  for (int64_t i = 0; i < input_size; ++i) {
    int64_t delay = static_cast<int64_t>(delays_span[i]);
    int64_t index_delayed = i - delay;
    if (index_delayed < 0) {
      index_delayed = 0;
    }

    samples_span[i] = 0.5 * (samples_span[i] + samples_span[index_delayed]);
  }
}

我们可以看到它达到了预期的效果（我在 GPU 执行方面遇到了问题，我不想调试，所以我只显示 CPU 结果）：

╭──────────────┬──────────┬────────────────────╮
│ Method       │ Device   │   Median time [ms] │
├──────────────┼──────────┼────────────────────┤
│ plain_python │ CPU      │              6.077 │
│ torchscript  │ CPU      │              4.273 │
│ numba        │ CPU      │              0.007 │
│ cpp          │ CPU      │              0.002 │
╰──────────────┴──────────┴────────────────────╯

另外，为什么GPU执行速度慢很多，问题是这里的代码本质上是串行的，在串行执行中CPU总是会胜过GPU。通过使用张量运算符批量运行操作，此代码可以并行化，然后我想您会看到 GPU 真正闪耀。

为什么 PyTorch C++ 扩展比其等效的 numba 版本慢得多？

问题描述投票：0回答：1

1个回答

最新问题

为什么 PyTorch C++ 扩展比其等效的 numba 版本慢得多？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1