Java Vector API 对双精度列表求和的问题

问题描述 投票:0回答:1

我实现了三个简单的 sum 方法(在 scala 中),它们作用于 Array[Double] (或 double[])

    inline def sum2 =
      var sum: Double = 0.0
      var i: Int = 0
      inline val species = DoubleVector.SPECIES_256

      while i < species.loopBound(vec.length) do        
        val vec2: DoubleVector = DoubleVector.fromArray(species, vec, i)        
        sum = sum + vec2.reduceLanes(VectorOperators.ADD)
        i += species.length()
      end while      
      while i < vec.length do
        sum += vec(i)
        i += 1
      end while
      sum
    end sum2

    inline def sum: Double =
      var sum = 0.0
      var i = 0;
      while i < vec.length do
        sum = sum + vec(i)
        i = i + 1
      end while
      sum


    inline def sum3 =
      var i: Int = 0
      val species = DoubleVector.SPECIES_256

      var acc = DoubleVector.zero(species)

      while i < species.loopBound(vec.length) do
        val vec2: DoubleVector = DoubleVector.fromArray(species, vec, i)
        acc =
          acc.add(vec2) // This line, appears to break JMH. It's not clear why as it passes a unit test and compiles.
        i += species.length()
      end while

      var temp = acc.reduceLanes(VectorOperators.ADD)
      // var temp = 0.0
      while i < vec.length do
        temp += vec(i)
        i += 1
      end while
      temp
    end sum3

然后设置一个基准,看看是否有差异。


@BenchmarkMode(Array(Mode.Throughput))
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(value = 1)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
abstract class BLASBenchmark:


  private final val rand: Random = new Random(0);

  protected def randomDouble(): Double =
    return rand.nextDouble();

  protected def randomDoubleArray(n: Int): Array[Double] =
    val res = new Array[Double](n);

    for i <- 0 until n do res(i) = rand.nextDouble();
    end for
    return res;
  end randomDoubleArray

  protected def randomFloat(): Float =
    return rand.nextFloat();

  protected def randomFloatArray(n: Int): Array[Float] =
    val res = new Array[Float](n);
    for i <- 0 until n do res(i) = rand.nextFloat();
    end for
    return res;
  end randomFloatArray
end BLASBenchmark

@State(Scope.Thread)
class SumBenchmark extends BLASBenchmark:

  @Param(Array("3", "10", "100", "10000", "100000"))
  var len: String = uninitialized;

  var arr: Array[Double] = uninitialized

  @Setup(Level.Trial)
  def setup: Unit =
    arr = randomDoubleArray(len.toInt);
    ()
  end setup

  @Benchmark
  def sum_loop(bh: Blackhole) =
    val r = arr.sum
    bh.consume(r);
  end sum_loop

  @Benchmark
  def sum_vec(bh: Blackhole) =
    val r = arr.sum2
    bh.consume(r);
  end sum_vec

  /**
This doesn't work. I don't understand why.
   */
  @Benchmark
  def sum_vec_alt(bh: Blackhole) =
    val r = arr.sum3
    bh.consume(r);
  end sum_vec_alt

end SumBenchmark

结果如下(本地)

Benchmark               (len)   Mode  Cnt          Score           Error  Units
SumBenchmark.sum_loop       3  thrpt    3  650060202.104 ± 176059544.674  ops/s
SumBenchmark.sum_loop      10  thrpt    3  422309397.044 ±  19968584.370  ops/s
SumBenchmark.sum_loop     100  thrpt    3   30901788.878 ±    354896.502  ops/s
SumBenchmark.sum_loop   10000  thrpt    3     116283.523 ±      4388.696  ops/s
SumBenchmark.sum_loop  100000  thrpt    3      11487.085 ±       586.552  ops/s
SumBenchmark.sum_vec        3  thrpt    3  607547613.267 ±  25376258.940  ops/s
SumBenchmark.sum_vec       10  thrpt    3   48259596.149 ±   1205351.697  ops/s
SumBenchmark.sum_vec      100  thrpt    3    4604684.325 ±    122538.788  ops/s
SumBenchmark.sum_vec    10000  thrpt    3      45689.056 ±      1872.688  ops/s
SumBenchmark.sum_vec   100000  thrpt    3       4729.203 ±        83.013  ops/s

我阅读本文的方式是,我的矢量化版本实际上比简单的 while 循环慢得多。

我的问题;

  • 我测量的是“正确的东西”吗?我不是 JMH 的自信用户。
  • 我使用 VectorAPI 的实现是否幼稚或明显在某处导致性能下降?
  • 如果以上都不是,这会是预期的吗?它似乎破坏了 VectorAPI 的价值主张,我希望能够加速!
scala vector simd jmh
1个回答
0
投票

我相信我使用的算法确实很幼稚。

    inline def sum: Double =
      var i: Int = 0

      var acc = DoubleVector.zero(Matrix.doubleSpecies)
      val sp = Matrix.doubleSpecies
      val l = sp.length()

      while i < sp.loopBound(vec.length) do
        acc = acc.add(DoubleVector.fromArray(Matrix.doubleSpecies, vec, i))
        i += l
      end while
      var temp = acc.reduceLanes(VectorOperators.ADD)
      // var temp = 0.0
      while i < vec.length do
        temp += vec(i)
        i += 1
      end while
      temp
    end sum

主要变化是;

  • 删除 length 方法并从热循环中调用 Specires
  • 不分配中间数据结构 - 直接分配回累加器
  • 沿车道积累,然后在终点减少
  • 选择物种为“首选物种”

这提供了更令人满意的结果。

umBenchmark.sum_loop                  3     N/A  thrpt    3  454987372.518 ±  4974554.176  ops/s
SumBenchmark.sum_loop                100     N/A  thrpt    3   22881036.167 ±   429259.728  ops/s
SumBenchmark.sum_loop             100000     N/A  thrpt    3      10733.807 ±       65.014  ops/s
SumBenchmark.sum_vec                   3     N/A  thrpt    3  454811646.671 ± 12166349.908  ops/s
SumBenchmark.sum_vec                 100     N/A  thrpt    3   40353921.106 ±   547651.178  ops/s
SumBenchmark.sum_vec              100000     N/A  thrpt    3      40141.621 ±      714.686  ops/s
SumBenchmark.sum_vec_alt               3     N/A  thrpt    3  505591602.643 ± 36682920.662  ops/s
SumBenchmark.sum_vec_alt             100     N/A  thrpt    3  110847968.173 ±  1363183.523  ops/s
SumBenchmark.sum_vec_alt          100000     N/A  thrpt    3      42821.901 ±      244.293  ops/s

对于非常大的向量,为 4 倍(左右)。

© www.soinside.com 2019 - 2024. All rights reserved.